# Visualizing Hierarchies

**The dendrogram of a hierarchical clustering**

In [None]:
"""

Hierarchical clustering proceeds in steps. In the beginning, every country is its own cluster - so there are as many clusters as there are countries!
At each step, the two closest clusters are merged. This decreases the number of clusters, and eventually, there is only one cluster left,
and it contains all the countries. This process is actually a particular type of hierarchical clustering called "agglomerative clustering"



The entire process of the hierarchical clustering is encoded in the dendrogram. At the bottom, each country is in a cluster of its own.
The clustering then proceeds from the bottom up. Clusters are represented as vertical lines, and a joining of vertical lines indicates a merging of clusters

"""

In [None]:
"""

Import:
linkage and dendrogram from scipy.cluster.hierarchy.
matplotlib.pyplot as plt.


Perform hierarchical clustering on samples using the linkage() function with the method='complete' keyword argument. Assign the result to mergings.
Plot a dendrogram using the dendrogram() function on mergings. Specify the keyword arguments labels=varieties, leaf_rotation=90, and leaf_font_size=6

"""


# Perform the necessary imports
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt

# Calculate the linkage: mergings
mergings = linkage(samples, method = 'complete')

# Plot the dendrogram, using varieties as labels
dendrogram(mergings,
           labels=varieties,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()


In [None]:
### Hierarchies of stocks

"""

In chapter 1, we used k-means clustering to cluster companies according to their stock price movements. Now, you'll perform hierarchical clustering of the companies.
We are given a NumPy array of price movements movements, where the rows correspond to companies, and a list of the company names companies.
SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so we'll need to use the normalize() function from sklearn.preprocessing instead of Normalizer

"""



"""

Import normalize from sklearn.preprocessing.
Rescale the price movements for each stock by using the normalize() function on movements.
Apply the linkage() function to normalized_movements, using 'complete' linkage, to calculate the hierarchical clustering. Assign the result to mergings.
Plot a dendrogram of the hierarchical clustering, using the list companies of company names as the labels. In addition, specify the leaf_rotation=90, and leaf_font_size=6 keyword arguments.

"""


# Import normalize
from sklearn.preprocessing import normalize

# Normalize the movements: normalized_movements
normalized_movements = normalize(movements)

# Calculate the linkage: mergings
mergings = linkage(normalized_movements, method = 'complete')

# Plot the dendrogram
dendrogram(mergings,
           labels=companies,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()

# Cluster labels in Hierarchical Clustering

** Intermediate Clusterings & Height **

In [None]:
"""

1) What is an Intermediate Clustering?
----------------------------------------
At different points in the process, we can stop merging and take the clusters as they are.


2) What Does the Height in a Dendrogram Mean?
----------------------------------------------
The y-axis of the dendrogram shows the distance between clusters when they merge.

If two clusters merge at height 6, it means they were 6 units apart.

The bigger the height, the less similar the clusters were before merging


3) How to Use This in Practice?
-------------------------------------
We can choose a height on the dendrogram to decide how many clusters we want.

If we stop merging at a lower height, we get many, small clusters.
If we stop merging at a higher height, we get fewer, larger clusters.



"""

**Distance Between Clusters**

In [None]:
"""


1) How Do We Measure the Distance Between Clusters?
--------------------------------------------------------
When grouping things together in hierarchical clustering, we need to decide how to measure the distance between two clusters. This is called the "linkage method".



"Complete linkage" means:
-------------------------------
The distance between two clusters is the largest distance between any two points (one from each cluster).
This ensures that clusters stay compact and don’t have very distant points inside them.


Example:

two groups (clusters):

Cluster 1: A (160 cm), B (170 cm)
Cluster 2: C (180 cm), D (190 cm)

Now, we want to measure the distance between these two clusters.

Step 1: Find All Possible Distances
We check the height differences between every student from Cluster 1 and every student from Cluster 2:

A (160 cm) vs. C (180 cm) → 20 cm
A (160 cm) vs. D (190 cm) → 30 cm
B (170 cm) vs. C (180 cm) → 10 cm
B (170 cm) vs. D (190 cm) → 20 cm

Step 2: Choose the Maximum Distance
The largest difference is 30 cm, so the distance between these two clusters is 30 cm.


Some other linkages:
------------------------

Single linkage: Uses the smallest distance between two points.
Average linkage: Uses the average distance between all pairs of points.
Centroid linkage: Uses the distance between cluster centers.


"""

In [None]:
"""

Import linkage and dendrogram from scipy.cluster.hierarchy.
Perform hierarchical clustering on samples using the linkage() function with the method='single' keyword argument. Assign the result to mergings.
Plot a dendrogram of the hierarchical clustering, using the list country_names as the labels. In addition, specify the leaf_rotation=90, and leaf_font_size=6 keyword arguments as you have done earlier.

"""


# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

# Calculate the linkage: mergings
mergings = linkage(samples, method = 'single')

# Plot the dendrogram
dendrogram(mergings,
           labels=country_names,
           leaf_rotation=90,
           leaf_font_size=6,
)
plt.show()


In [None]:
### Intermediate clusterings

"""

Displayed is the dendrogram for the hierarchical clustering of the grain samples that you computed earlier. If the hierarchical clustering were stopped
at height 6 on the dendrogram, how many clusters would there be?

"""


from scipy.cluster.hierarchy import linkage
mergings = linkage(samples, method='complete')
from scipy.cluster.hierarchy import fcluster
labels = fcluster(mergings, 6, criterion='distance')
print(labels)

In [None]:
### Extracting the cluster labels

"""

Import:
pandas as pd.
fcluster from scipy.cluster.hierarchy.


Perform a flat hierarchical clustering by using the fcluster() function on mergings. Specify a maximum height of 6 and the keyword argument criterion='distance'.
Create a DataFrame df with two columns named 'labels' and 'varieties', using labels and varieties, respectively, for the column values. This has been done for you.
Create a cross-tabulation ct between df['labels'] and df['varieties'] to count the number of times each grain variety coincides with each cluster label.

"""

# Perform the necessary imports
import pandas as pd
from scipy.cluster.hierarchy import fcluster

# Use fcluster to extract labels: labels
labels = fcluster(mergings, 6, criterion='distance')

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)


"""

varieties  Canadian wheat  Kama wheat  Rosa wheat
labels
1                      14           3           0
2                       0           0          14
3                       0          11           0

"""

# t-SNE for 2 dimensional maps

**t-SNE**

In [None]:
"""

t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique used to visualize high-dimensional data in 2D or 3D.

A dataset where each sample has many features (e.g., 50 or more dimensions). We can't easily see patterns in such high dimensions.
t-SNE helps by reducing it to just 2 or 3 dimensions, so we can plot it and observe clusters or relationships.



The Iris dataset has flowers with 4 measurements:

Petal length
Petal width
Sepal length
Sepal width

Each flower also belongs to one of 3 species:
Setosa
Versicolor
Virginica

Applying t-SNE
----------------
t-SNE takes the 4D iris data and reduces it to 2D for visualization.

It wasn’t told the species labels, just the measurements.
After plotting the reduced data in 2D, we see 3 distinct groups when we color them by species.


"""

**t-SNE Learning Rate**

In [None]:
"""

1) Learning Rate in t-SNE
------------------------------
The learning rate is an important setting in t-SNE that affects how well it maps high-dimensional data to 2D.
If the learning rate is too low or too high, the data points might look weird in the plot.

Bad learning rate? We will see all points squeezed together in a messy way.
Good learning rate? The clusters will be more clearly separated.

Note: Usually, trying values between 50 and 200 is enough to get a good result.



2) t-SNE Gives Different Results Every Time
-----------------------------------------------
Unlike some other techniques, t-SNE does not always give the same plot, even if run it on the same data.
The axes (X and Y) don’t have a fixed meaning

However, the relationships between clusters remain the same.
For example, if we run t-SNE on wine varieties, we will see that the three types of wine stay in similar groups, but the whole plot might be rotated or flipped.


So, in t-SNE plots, no need to worry about the exact position of clusters—just focus on how the points relate to each other.


"""

In [None]:
"""

Import TSNE from sklearn.manifold.
Create a TSNE instance called model with learning_rate=200.
Apply the .fit_transform() method of model to samples. Assign the result to tsne_features.
Select the column 0 of tsne_features. Assign the result to xs.
Select the column 1 of tsne_features. Assign the result to ys.
Make a scatter plot of the t-SNE features xs and ys. To color the points by the grain variety, specify the additional keyword argument c=variety_numbers.

"""

# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate=200)

# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(samples)

# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1st feature: ys
ys = tsne_features[:,1]

# Scatter plot, coloring by variety_numbers
plt.scatter(xs, ys, c = variety_numbers)
plt.show()


In [None]:
"""

Import TSNE from sklearn.manifold.
Create a TSNE instance called model with learning_rate=50.
Apply the .fit_transform() method of model to normalized_movements. Assign the result to tsne_features.
Select column 0 and column 1 of tsne_features.
Make a scatter plot of the t-SNE features xs and ys. Specify the additional keyword argument alpha=0.5.
Code to label each point with its company name has been written for you using plt.annotate(), so just hit submit to see the visualization!

"""


# Import TSNE
from sklearn.manifold import TSNE

# Create a TSNE instance: model
model = TSNE(learning_rate = 50)

# Apply fit_transform to normalized_movements: tsne_features
tsne_features = model.fit_transform(normalized_movements)

# Select the 0th feature: xs
xs = tsne_features[:,0]

# Select the 1th feature: ys
ys = tsne_features[:,1]

# Scatter plot
plt.scatter(xs, ys, alpha = 0.5)

# Annotate the points
for x, y, company in zip(xs, ys, companies):
    plt.annotate(company, (x, y), fontsize=5, alpha=0.75)
plt.show()
