# Unsupervised Learning

In [None]:
"""

Unsupervised learning is a class of machine learning techniques for discovering patterns in data.
For instance, finding the natural "clusters" of customers based on their purchase histories, or searching for patterns and correlations among these purchases,
and using these patterns to express the data in a compressed form.

These are examples of unsupervised learning techniques called "clustering" and "dimension reduction

"""

**Cluster labels for new samples**

In [None]:
"""

After running K-means, the algorithm creates "centroids"—these are the average positions of each cluster in the data space.
For example, if someone is grouping iris flowers by their petal length and width, each cluster will have a centroid which represents the average petal length
and width of flowers in that group.



Now, what happens if get new data (like new iris flowers)?
----------------------------------------------------------====

No need to run K-means from scratch on the new data. Instead, can use the existing centroids to assign these new samples to the closest cluster.

Example:
Suppose after clustering, one have 3 clusters with centroids at:
Centroid 1: (5.1, 3.5)
Centroid 2: (6.2, 2.8)
Centroid 3: (5.8, 4.1)

Now, if someone get a new sample, say a flower with petal length 5.6 and width 3.0, the algorithm will compare this sample to the centroids of each cluster
and assign it to the cluster whose centroid is the closest.

In this case, the new sample would be assigned to Centroid 1 because it's closest in terms of distance.

"""

In [None]:
### How many clusters?

"""

You are given an array points of size 300x2, where each row gives the (x, y) co-ordinates of a point on a map. Make a scatter plot of these points,
and use the scatter plot to guess how many clusters there are.

matplotlib.pyplot has already been imported as plt.

Create an array called xs that contains the values of points[:,0] - that is, column 0 of points.
Create an array called ys that contains the values of points[:,1] - that is, column 1 of points.
Make a scatter plot by passing xs and ys to the plt.scatter() function.
Call the plt.show() function to show your plot.
How many clusters do you see?

"""


import matplotlib.pyplot as plt
xs = points[:,0]
ys = points[:,1]
plt.scatter(xs, ys)
plt.show()

In [None]:
### Clustering 2D points

"""

From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters.
You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise.
After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.


You are given the array points from the previous exercise, and also an array new_points

"""


"""

Import KMeans from sklearn.cluster.
Using KMeans(), create a KMeans instance called model to find 3 clusters. To specify the number of clusters, use the n_clusters keyword argument.
Use the .fit() method of model to fit the model to the array of points points.
Use the .predict() method of model to predict the cluster labels of new_points, assigning the result to labels.
Hit submit to see the cluster labels of new_points

"""


# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters = 3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)



# [0 1 2 0 1 0 1 1 1 2 0 1 1 2 2 1 2 2 1 1 2 1 0 1 0 2 1 2 2 0 0 1 1 1 2 0 1
#  1 0 1 2 0 0 2 0 1 2 2 1 1 1 1 2 2 0 0 2 2 2 0 0 1 1 1 0 1 2 1 0 2 0 0 0 1
#  0 2 2 0 1 2 0 2 0 1 2 1 2 0 1 1 1 0 1 1 0 2 2 2 2 0 1 0 2 2 0 0 1 0 2 2 0
#  2 2 2 1 1 1 1 2 2 1 0 1 2 1 0 2 1 2 2 1 2 1 2 0 1 0 0 1 2 0 1 0 0 2 1 1 0
#  2 0 2 1 0 2 2 0 2 1 1 2 1 2 2 1 1 0 1 1 2 0 2 0 0 1 0 1 1 0 0 2 0 0 0 2 1
#  1 0 2 0 2 2 1 1 1 0 1 1 1 2 2 0 1 0 0 0 2 1 1 1 1 1 1 2 2 1 2 2 2 2 1 2 2
#  1 1 0 2 0 0 2 0 2 0 2 1 1 2 1 1 1 2 0 0 2 1 1 2 1 2 2 1 2 2 0 2 0 0 0 1 2
#  2 2 0 1 0 2 0 2 2 1 0 0 0 2 1 1 1 0 1 2 2 1 0 0 2 0 0 2 0 1 0 2 2 2 2 1 2
#  2 1 1 0]

In [None]:
### Inspect your clustering

"""

Import matplotlib.pyplot as plt.
Assign column 0 of new_points to xs, and column 1 of new_points to ys.
Make a scatter plot of xs and ys, specifying the c=labels keyword arguments to color the points by their cluster label. Also specify alpha=0.5.
Compute the coordinates of the centroids using the .cluster_centers_ attribute of model.
Assign column 0 of centroids to centroids_x, and column 1 of centroids to centroids_y.
Make a scatter plot of centroids_x and centroids_y, using 'D' (a diamond) as a marker by specifying the marker parameter. Set the size of the markers to be 50 using s=50

"""

# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:,0]
ys = new_points[:,1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha = 0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50)
plt.show()



# Evaluating a Clustering

In [None]:
"""

1) Comparing Clusters with Known Labels

If we already have correct labels (like the actual species of iris flowers: Setosa, Versicolor, and Virginica), we can compare our clusters with these labels.

Suppose our K-means clustering groups flowers into three clusters. We can check if each cluster mostly contains one species.
    If Cluster 1 has mostly Setosa, Cluster 2 has mostly Versicolor, and Cluster 3 has mostly Virginica, our clustering is good!
    If clusters mix up species randomly, it's a sign that the clustering is not working well.



2) Measuring Quality Without Labels

But what if we don't have actual species labels?
We need a way to measure clustering quality without using correct answers.
There are techniques like inertia (within-cluster distance) and silhouette score, which help us decide if clusters are well-separated and compact.


3) Choosing the Right Number of Clusters
A good quality measure helps us decide how many clusters we should use.
We can try different numbers of clusters (2, 3, 4…) and see which gives the best separation.

"""

**Inertia measures clustering quality**

In [None]:
"""

1) What Makes a Good Clustering?
---------------------------------------
A good clustering means that data points in the same cluster are close together (not spread out too much).


2) What is Inertia?
-----------------------
Inertia measures how far data points are from the center (centroid) of their cluster.

Lower inertia is better because it means points are closer to their centroid, making clusters tighter.


NOTE: kmeans aims to place the clusters in a way that minimizes the inertia.

"""

**How many clusters to choose?**

In [None]:
"""

Ultimately, this is a trade-off. A good clustering has tight clusters (meaning low inertia). But it also doesn't have too many clusters.
A good rule of thumb is to choose an elbow in the inertia plot, that is, a point where the inertia begins to decrease more slowly.

"""

In [None]:
### How many clusters of grain?

"""

For each of the given values of k, perform the following steps:
Create a KMeans instance called model with k clusters.
Fit the model to the grain data samples.
Append the value of the inertia_ attribute of model to the list inertias.
The code to plot ks vs inertias has been written for you, so hit submit to see the plot!

"""

ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters = k)

    # Fit model to samples
    model.fit(samples)

    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)

# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

In [None]:
### Evaluating the grain clustering

"""

Create a KMeans model called model with 3 clusters.
Use the .fit_predict() method of model to fit it to samples and derive the cluster labels. Using .fit_predict() is the same as using .fit() followed by .predict().
Create a DataFrame df with two columns named 'labels' and 'varieties', using labels and varieties, respectively, for the column values. This has been done for you.
Use the pd.crosstab() function on df['labels'] and df['varieties'] to count the number of times each grain variety coincides with each cluster label. Assign the result to ct.
Hit submit to see the cross-tabulation!

"""


# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters = 3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(samples)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'varieties': varieties})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['varieties'])

# Display ct
print(ct)


# Transforming features for better clusterings

**Feature Variances**

In [None]:
"""

1) Problem: K-Means Clusters Don’t Match Wine Varieties
-----------------------------------------------------------
We want to group three types of wine using K-Means, but when we check the results, the clusters don’t match the actual wine varieties.

 Ideal Case: Each wine variety gets its own cluster.
 What Happened: The clustering is incorrect, and different wine varieties are mixed in the same cluster.



2) Why? Some Features Have Higher Variance Than Others
------------------------------------------------------------
Variance means how spread out a feature's values are.
If one feature has large values and another has small values, K-Means will give more importance to the large-value feature, which can lead to bad clusters.
Example:
In the wine dataset, Malic Acid has a high variance (its values are spread out).
Another feature, OD280, has a low variance (its values are closer together).
Since K-Means uses distances, it will pay more attention to Malic Acid and ignore OD280, causing incorrect clusters.



3) Solution: Standardize the Features Using StandardScaler
To fix this, we can use StandardScaler which makes all features have:
      Mean = 0
      Variance = 1

"""

In [None]:
"""

Import:
make_pipeline from sklearn.pipeline.
StandardScaler from sklearn.preprocessing.
KMeans from sklearn.cluster.


Create an instance of StandardScaler called scaler.
Create an instance of KMeans with 4 clusters called kmeans.
Create a pipeline called pipeline that chains scaler and kmeans. To do this, you just need to pass them in as arguments to make_pipeline().

"""


# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Create scaler: scaler
scaler = StandardScaler()

# Create KMeans instance: kmeans
kmeans = KMeans(n_clusters = 4)

# Create pipeline: pipeline
pipeline = make_pipeline(scaler , kmeans)


In [None]:
"""

Import pandas as pd.
Fit the pipeline to the fish measurements samples.
Obtain the cluster labels for samples by using the .predict() method of pipeline.
Using pd.DataFrame(), create a DataFrame df with two columns named 'labels' and 'species', using labels and species, respectively, for the column values.
Using pd.crosstab(), create a cross-tabulation ct of df['labels'] and df['species']

"""

# Import pandas
import pandas as pd

# Fit the pipeline to samples
pipeline.fit(samples)

# Calculate the cluster labels: labels
labels = pipeline.predict(samples)

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels':labels , 'species':species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'] , df['species'])

# Display ct
print(ct)


In [None]:
###Clustering stocks using KMeans

"""

In this exercise, we'll cluster companies using their daily stock price movements (i.e. the dollar difference between the closing and opening prices for each trading day).
We are given a NumPy array movements of daily price movements from 2010 to 2015 (obtained from Yahoo! Finance), where each row corresponds to a company,
and each column corresponds to a trading day.

Some stocks are more expensive than others. To account for this, include a Normalizer at the beginning of pipeline.
The Normalizer will separately transform each company's stock price to a relative scale before the clustering begins.

Normalizer() is different to StandardScaler(), Normalizer() rescales each sample - here, each company's stock price - independently of the other.

"""



"""

Import Normalizer from sklearn.preprocessing.
Create an instance of Normalizer called normalizer.
Create an instance of KMeans called kmeans with 10 clusters.
Using make_pipeline(), create a pipeline called pipeline that chains normalizer and kmeans.
Fit the pipeline to the movements array.

"""


# Import Normalizer
from sklearn.preprocessing import Normalizer

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters = 10)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)



# Pipeline(steps=[('normalizer', Normalizer()),
#                 ('kmeans', KMeans(n_clusters=10))])

In [None]:
### Which stocks move together?

"""

In the previous exercise, we clustered companies by their daily stock price movements. So which company have stock prices that tend to change in the same way?
We'll now inspect the cluster labels from your clustering to find out

"""



"""

Import pandas as pd.
Use the .predict() method of the pipeline to predict the labels for movements.
Align the cluster labels with the list of company names companies by creating a DataFrame df with labels and companies as columns. This has been done for you.
Use the .sort_values() method of df to sort the DataFrame by the 'labels' column, and print the result.
Hit submit and take a moment to see which companies are together in each cluster!

"""


# Import pandas
import pandas as pd

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

# Create a DataFrame aligning labels and companies: df
df = pd.DataFrame({'labels': labels, 'companies': companies})

# Display df sorted by cluster label
print(df.sort_values('labels'))




# labels                           companies
# 59       0                               Yahoo
# 15       0                                Ford
# 35       0                            Navistar
# 26       1                      JPMorgan Chase
# 16       1                   General Electrics
# 58       1                               Xerox
# 11       1                               Cisco
# 18       1                       Goldman Sachs
# 20       1                          Home Depot
# 5        1                     Bank of America
# 3        1                    American express
# 55       1                         Wells Fargo
# 1        1                                 AIG
# 38       2                               Pepsi
# 40       2                      Procter Gamble
# 28       2                           Coca Cola
# 27       2                      Kimberly-Clark
# 9        2                   Colgate-Palmolive
# 54       3                            Walgreen
# 36       3                    Northrop Grumman
# 29       3                     Lookheed Martin
# 4        3                              Boeing
# 0        4                               Apple
# 47       4                            Symantec
# 33       4                           Microsoft
# 32       4                                  3M
# 31       4                           McDonalds
# 30       4                          MasterCard
# 50       4  Taiwan Semiconductor Manufacturing
# 14       4                                Dell
# 17       4                     Google/Alphabet
# 24       4                               Intel
# 23       4                                 IBM
# 2        4                              Amazon
# 51       4                   Texas instruments
# 43       4                                 SAP
# 45       5                                Sony
# 48       5                              Toyota
# 21       5                               Honda
# 22       5                                  HP
# 34       5                          Mitsubishi
# 7        5                               Canon
# 56       6                            Wal-Mart
# 57       7                               Exxon
# 44       7                        Schlumberger
# 8        7                         Caterpillar
# 10       7                      ConocoPhillips
# 12       7                             Chevron
# 13       7                   DuPont de Nemours
# 53       7                       Valero Energy
# 39       8                              Pfizer
# 41       8                       Philip Morris
# 25       8                   Johnson & Johnson
# 49       9                               Total
# 46       9                      Sanofi-Aventis
# 37       9                            Novartis
# 42       9                   Royal Dutch Shell
# 19       9                     GlaxoSmithKline
# 52       9                            Unilever
# 6        9            British American Tobacco