# Unsupervised Learning

## What is Unsupervised Learning?
Unsupervised learning is about finding hidden patterns in data **without using labels**. Unlike supervised learning, there's no guidance — the algorithm just explores the data and discovers structure on its own.

Two main techniques:
- **Clustering**: Grouping similar data points together
- **Dimension reduction**: Simplifying data by reducing the number of features while keeping essential information

## Supervised vs. Unsupervised
- **Supervised**: You have inputs and outputs. The goal is to learn the mapping (like predicting cancer type from tumor size).
- **Unsupervised**: You only have inputs. The goal is to discover hidden patterns (like clustering customer behavior).

## The Iris Dataset
A classic dataset used in ML:
- 3 species of iris flowers
- Each sample has 4 measurements: petal length, petal width, sepal length, sepal width
- This means the data is **4-dimensional**

## Data Format
- Data is stored as a 2D NumPy array:
  - Rows = samples (individual flowers)
  - Columns = features (measurements)

## k-Means Clustering
A popular clustering method:
- You choose the number of clusters (`k`)
- The algorithm finds `k` clusters by grouping similar data points
- Each cluster has a **centroid** (average of points in that group)
- New data can be assigned to clusters by checking which centroid it’s closest to

### How to use k-means with `scikit-learn`:
1. Import:  
   `from sklearn.cluster import KMeans`
2. Create a model:  
   `model = KMeans(n_clusters=3)`
3. Fit the model to the data:  
   `model.fit(samples)`
4. Predict labels (cluster numbers) for each sample:  
   `labels = model.predict(samples)`
5. To predict for new samples:  
   `new_labels = model.predict(new_samples)`

## Visualizing Clusters
Scatter plots help visualize the clustering:
- Use matplotlib (`import matplotlib.pyplot as plt`)
- Choose two features to plot (e.g., sepal length vs. petal length)
- Color the points based on cluster label (`c=labels`)
- Display with `plt.show()`

## Final Thoughts
Unsupervised learning helps make sense of unlabeled data — it's about **exploration** rather than **prediction**. You don’t need to know the answers beforehand; you let the data tell its own story.


In [1]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(points)

# Determine the cluster labels of new_points: labels
labels = model.predict(new_points)

# Print cluster labels of new_points
print(labels)


NameError: name 'points' is not defined

In [None]:
# Import pyplot
import matplotlib.pyplot as plt

# Assign the columns of new_points: xs and ys
xs = new_points[:, 0]
ys = new_points[:, 1]

# Make a scatter plot of xs and ys, using labels to define the colors
plt.scatter(xs, ys, c=labels, alpha=0.5)

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:, 0]
centroids_y = centroids[:, 1]

# Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x, centroids_y, marker='D', s=50, c='red')  # Diamonds in red color

plt.title("Clustered Points with Centroids")
plt.xlabel("X-coordinate")
plt.ylabel("Y-coordinate")
plt.show()