# Clustering

![image.png](attachment:79ea3b60-36da-4ed4-8d92-e50ee38bbaa9.png)

### Learning Goals
1. Be able to discuss and discover use cases for clustering across multiple industries.
2. Be able to recognize of common clustering algorithms.
3. General understanding of how the k-means clustering algorithm works.
4. Ability to implement k-means clustering in python.
5. Handle outliers using IQR.
6. More practice scaling data.
7. Strategies for missing values.
8. Plotting clusters.
9. Ability to make use of clusters discovered later down the data science pipeline.



### About Clustering
![image.png](attachment:c96c94c8-453d-4a3b-863e-8634f2c00a5a.png)

### Clustering is...

- **Unsupervised** machine learning methodology
- Used to group and identify similar observations when we do not have labels that identify the groups.
- Often a preprocessing or an exploratory step in the data science pipeline.

Suppose you have a data set with observations, features, but no labels or target variable. 

**You want to predict which group of similar observations a new observation will fall in.** 
- Without the labels, you cannot use a supervised algorithm, such as a random forest or KNN. 

We can address this challenge by finding groups of data in our dataset which are similar to one another. These groups are called clusters.

Clustering can also be used for **data exploration**, or to **generate a new feature** that can then be fed into a supervised model.

Formally, clustering is an **unsupervised process** of grouping similar observations or objects together. In this process similarities are based on comparing a vector of information for each observation or object, often using various mathematical distance functions.



### Clustering methodologies include:
- Partitioned based clustering (K-Means)
- Hierarchical clustering
- Density-based clustering (Density-Based Spatial Clustering of Applications with Noise (DBSCAN))

Every methodology follows a different set of rules for defining the ‘similarity’ among data points. 

While there are more than 100 known clustering algorithms, few of the algorithms are popularly used, and we will be focusing on K-means in the coming lesson.

### Clustering Use Cases
- Text: Document classification, summarization, topic modeling, recommendations
- Geographic: Crime zones, housing prices
- Marketing: Customer segmentation, market research
- Anomaly Detection: Account takeover, security risk, fraud
- Image Processing: Radiology, security



![image.png](attachment:344cd59d-a510-404b-8643-296ffa3872a0.png)

## Vocabulary
### Euclidean Distance

The shortest distance between two points in n-dimensional space, a.k.a. L2 distance.

< *span* >< *spanclass* =" *MathJaxPreview* "> *√∑ni=1(qi−pi)2* < */span* >< *scripttype* =" *math/tex* "> *√∑ni=1(qi−pi)2*



### Manhattan Distance

The distance between two points is the sum of the absolute differences of their Cartesian coordinates. 

Also known as: 
- taxicab metric
- rectilinear distance
- ![Screen Shot 2021-03-25 at 11.19.26 AM.png](attachment:2ac0e1f0-d8e5-4f6e-8ed6-ca3cb7ff7eab.png) distance
- ![Screen Shot 2021-03-25 at 11.19.43 AM.png](attachment:25a01c06-98c7-4747-afd0-d9bbdbef0f3e.png) distance or ![Screen Shot 2021-03-25 at 11.20.06 AM.png](attachment:e6fe6297-cc05-4711-9f45-eca98b5a54b1.png) norm
- snake distance
- city block distance
- Manhattan length.

### Cosine Similarity

Cosine Similarity measures the cosine of the angle between two vectors to define similarity between two vectors. 

It is a measure of orientation and not magnitude: two vectors with the same orientation, i.e. parallel, have a cosine similarity of 1 indicating they are maximally "similar". 

Two vectors oriented at 90° relative to each other, i.e. perpendicular or orthogonal, have a similarity of 0 and are considered maximally "dissimilar". 

If two vectors diametrically opposed (180∘) have a similarity of -1. The cosine similarity is particularly used where the outcome is neatly bounded in [0,1].

### Sparse vs. Desnse Matrix

A sparse matrix is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense.

The number of zero-valued elements divided by the total number of elements (e.g., m × n for an m × n matrix) is called the **sparsity** of the matrix (which is equal to 1 minus the density of the matrix). Using those definitions, a matrix will be sparse when its sparsity is greater than 0.5.

### Manhattan (Taxicab) vs Euclidean Distance

![image.png](attachment:98362f35-b72e-4773-8996-006b4cc6918a.png)