# Week 9 - Unsupervised Data Analysis

## 1. Supervised vs. unsupervised machine learning
- Uses labelled datasets  
- Commonly used in classification and regression  

## Unsupervised learning
- Unsupervised learning employs machine learning algorithms to
  examine and group unlabeled datasets.  
- These algorithms unveil concealed patterns within data
  autonomously, eliminating the necessity for human guidance,
  hence termed "unsupervised."  
- Common tasks:  
    - Clustering, i.e., assign similar data points into groups  
    - Association mining, i.e., use different rules to find relationships
    between variables in a given dataset, often used for market
    basket analysis and recommendation engines, along the lines of
    “Customers Who Bought This Item Also Bought” recommendations.  
    - Dimensionality reduction, i.e., reduces the number of data
    inputs to a smaller size while also preserving the data integrity

### Differences
- The main difference: Labeled data  
- Other differences:  
    - Goals: In supervised learning, the goal is to predict outcomes for
    new data. With an unsupervised learning algorithm, the goal is
    to get insights from large volumes of new data.  
    - Applications: Supervised learning models are ideal for spam
    detection, sentiment analysis, weather forecasting and pricing
    predictions, among other things.   
    In contrast, unsupervised learning is a great fit for anomaly  
    detection and customer personas.  
    - Drawbacks: Training supervised learning models can be time-
    consuming, as it demands expertise to label input and output
    variables accurately. Conversely, unsupervised learning
    approaches may yield highly inaccurate results without human
    intervention to validate the output variables.  

Semi-supervised learning
- Semi-supervised learning combines aspects of both supervised
  learning and unsupervised learning.  
- Machine learning techniques that fall under this category
  utilize both labeled and unlabeled data to train a predictive
  model.  
- Typically, it works in the following way:  
(1) Semi-supervised learning uses a small amount of labeled data to train
an initial model, which can be used to predict labels on a larger amount
of unlabeled data.  
(2) The model is then applied iteratively to both originally labeled data and
data with predicted labels (pseudo-labels).  
(3) After, you will add your most accurate predictions to the labeled
dataset and repeat the process again to continue improving the
performance of your model.  

Semi-supervised learning  
- Self-supervised learning is a machine learning process where
  the model trains itself to learn one part of the input from
  another part of the input.  
- In this process, the unsupervised problem is transformed into
  a supervised problem by In this process, the unsupervised
  problem is transformed into a supervised problem by auto-
  generating the labels.  
- The process of the self-supervised learning method is to
  identify any hidden part of the input from any unhidden part
  of the input, e.g.:  
    - In natural language processing, if we have a few words, using self-
      supervised learning we can complete the rest of the sentence.  
    - In a video, we can predict past or future frames based on available
      video data.  

## 2. Segmenting data  
- Sometimes the segmenting of data is because of the
  context of the data  
  - Separate sources  
  - Separate collection circumstances  
  - Social or physical distinctions  
- Sometimes we don’t have pre-determined segments, but
  we want segmentation  
  - Some of the data may be similar  
  - Some of the modelling would be better if it didn’t need to
    represent all of the data  
  - Better decision-making if we consider each segment
    independently  

### Identifying Customer Segments  
- Customers are grouped into segments  
- Marketing is then specialised to each segment  
  - leads to better marketing  
- in healthcare, segments are called cohorts  
  - used for patient management and staff organisation  

Segmentation can be done by various attributes e.g. Geographic, Demographic, Psychographic, Behavioural  
A segmentation model is a graphical model where  
- the cluster variable is unknown, called “latent”  
- the cluster variable identifies the segments  
- latent means the variable is never observed in the data.  



## 3. Clustering data
A cluster is  
- segmented data for analysis  
- segmented network nodes  
- segmented data storage  
- a group of associated computers  
So clustering = segmentation… sometimes  

Clustering tends to be associated with segmentation that
allows us to recognize similar combinations of attribute
values when we don’t have predefined categories.  

### Uses of clustering
- Text documents, e.g., patents, legal cases,
  webpages, questions and feedback  
    - Topic modelling  
- Clients, e.g., recommendation systems  
- Fault detection, e.g., fraud, network security  
- Missing data  
- A clustering task may require a number of different
  algorithms/approaches.  

Elements in a cluster
- Are similar in some attributes  
- May consider some attributes to weigh more than others  
    - Not all attributes are as important as others  
    - Needs feature selection  
- May be considered to be close to each other  
    - Needs distance measurements  

Clustering terms  
- Distance  
- Centroid  
- Nearest neighbour  

### K-means  
1. Randomly select centroids for K clusters
2. Select nearest data points as cluster population
3. Find mean values in each cluster and use that as new centroid
4. Re-evaluate populations and centroids until stable/convergance
Stopping criteria for K-means clustering  
(1) Centroids of newly formed clusters do not change  
(2) Points remain in the same cluster  
(3) Maximum number of iterations is reached  

Downsides
- Does not work with categorical data and it is susceptible to outliers  
- Have to predefine a value for K  
- No guarantee there are actually clusters to find  


### Evaluating clustering algorithms  
Clustering is easier to evaluate by visualization  
High dimensional data can be projected to lower dimensions via dimentionality reduction e.g. PCA/t-SNE  

#### Internal metrics  
- Internal metrics use only the data and the clustering  
  output to measure how well the clusters are formed.  
- Some well-known internal metrics include:  
    - Silhouette Score  
    - Dunn’s Index  

#### Silhouette Score  
- Measures separation distance between clusters  
- Value range [-1, 1]  
    - Closer to +1: Cluster samples further away from neighbouring clusters  
    - Close to 0: Sample very close to decision boundary between neighbouring clusters  
    - Negative values: Samples assigned to wrong cluster  
![image.png](attachment:image.png)
<style type="text/css">
    img {
        width: 400px;
    }
</style>

#### Elbow method  
- The Elbow Method is a graphical representation of finding the optimal ‘K’ in a 
  K-means clustering.  
- It works by finding WCSS (Within-Cluster Sum of Square) i.e. the sum of the square
  distance between points in a cluster and the cluster centroid.  
- It involves plotting the variance (i.e., WCSS) explained by different numbers of
  clusters and identifying the “elbow” point, where the rate of variance decreases
  sharply levels off, suggesting an appropriate cluster count for analysis or model
  training.  
![image-2.png](attachment:image-2.png)  

#### External Metrics  
- External metrics use some external information, such as
  labels, classes, or ground truth, to measure how well the
  clusters match the expected or desired outcomes.  
- Some well-known internal metrics include:  
  - Accuracy, i.e., how many data points are correctly
    grouped?  
  - Rand index  
    - Similarity measure between two clusters by considering all pairs of samples 
    and counting pairs that are assigned in the same or different clusters in
    the predicted and true clusterings.  
    - RI = Number of agreeing pairs/number of pairs  
    - Value range of [0,1], with 1 representing perfect match  
  - Mutual information  
  


### Hierarchical clustering  
- Clusters within clusters
- Greedy, costly
- No randomness (reproducible), can cut the tree at any level

#### Agglomerative (bottom-up)
- Common steps:
    - Treat each data point as a centroid in a cluster of
    population 1
    - Form new clusters by merging nearby clusters
    - Continue until only one cluster
- Various ways to calculate which clusters should be
  merged, often looking at (min or max) distances of the
  cluster population to each other
- The results of hierarchical clustering are usually presented
  in a dendrogram

#### Divisive (top-down)
- Consider all data points as a single cluster
- In each iteration, separate data points from cluster which are not similar
- Each data point which is separated is considered as an individual cluster, 
  and we will end up with n clusters in the end (assuming there are n data points).
- Less commonly used compared to agglomerative

## 4. Integrate data science findings
Model Uncertainty  
- Machine learning models can display confidence degree in predictions  
- Provides transparency and tells you room for improvement  

Cross-Validation  
- Used in statistical analysis and machine learning to evaluate model performance and reliability  
    - Reduce overfitting  
    - Provide reliable performance metric  
    - Potentially alleviate model uncertainty  
- Process:  
    1. Splitting the Data: The dataset is divided into “folds” or
    subsets. The typical structure involves k subsets, hence the
    term “k-fold cross-validation.”  
    2. Iterative Training and Testing: In k-fold cross-validation, the
    model is trained k times, each time using a different fold as
    the test set while using the remaining k-1 folds for training.  
    3. Performance Metrics: The results from each iteration are
    averaged to provide a more reliable estimate of the model’s
    performance.  




