<a href="https://colab.research.google.com/github/MaralAminpour/ML-BME-Course-UofA-Fall-2023/blob/main/Week-5-Clustering%20and%20Features/feature_extraction_introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part IV: Feature Extraction

## what is feature extraction?

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/ex1.gif' width=200px >

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/ex2.png' width=400px >

Feature extraction is the art of **simplifying our data**. Imagine you have a vast canvas with a sprawling and complex painting. Feature extraction is like being able to **reproduce that painting on a smaller canvas, while still capturing its essence and beauty**. When we explore large datasets, especially those from **signals or images**, **there's so much going on!** There are tons of details, but not all of them are crucial. Feature extraction helps us **filter out the noise and focus on the salient features**, those significant brush strokes that truly define the artwork.

**Why is this important?** Well, when we **reduce the complexity of our data**, we're setting ourselves up for success in the world of machine learning. A simplified, yet meaningful dataset is **less prone to overfitting**, where our model might get too caught up in the tiny details and miss the big picture. Furthermore, models trained on such data are not only **more accurate but also learn faster.**

So, the next time you come across a large dataset, think of it as that vast canvas. Through feature extraction, you're not only making the canvas more manageable **but ensuring that every brush stroke on it truly counts**. By emphasizing the right details and sidelining the redundant ones, we create a perfect setting for our machine learning models to thrive.




## Feature extraction examples: Brain structure volumes

Let’s now have a look at a few examples of feature extraction. You will find, that feature extraction methods are often **non-trivial and application specific**. This makes sense, because extracting **salient features** requires **extensive domain knowledge**. I will now explain the main ideas how we created the features for machine learning examples in this module.

One example is trying to find the **size of different parts of the brain** using MRI images from babies born too early.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/extract2.png' width=600px >

To extract volumes of brain structures from brain MRI of preterm babies, we first need to perform segmentation of these structures. To do this, we first have to **mark or segment these parts** on the image. After marking them, we measure **how many small picture elements (voxels)** are in that marked area. By **multiplying** this count with the **size of each voxel**, we find out the size of these brain parts.

The segmentation is performed by combining two technique: clustering using Gaussian Mixture Models, to separate different brain tissue types, including WM, GM and CSF.  

To mark these parts, we use two methods. The first method **groups similar brain tissues** using what's called **Gaussian Mixture Models**. This helps separate different parts like the **white matter (WM), grey matter (GM), and the fluid (CSF)**. Additionally we propagate labels from a number of segmented atlases using deformable registration, to help us define anatomical regions. The second method uses previously marked images as a guide to help mark the regions in our current image.

### Segmentation:

Segmentation is a critical step in many medical image processing tasks. It involves **partitioning an image into multiple segments, where each segment corresponds to a particular object or part of an object**.

In the context of brain imaging, segmentation is used to differentiate between **various brain structures and tissue types.**

### 1. Clustering using Gaussian Mixture Models (GMM):

**Gaussian Mixture Models (GMM)**:

- A GMM is a probabilistic model that **assumes that the data is generated from a mixture of several Gaussian distributions.**

- Each Gaussian distribution represents a cluster.

- In the context of brain image segmentation, the different clusters could represent **different tissue types.**

**Application to Brain Tissue Types**:

- The brain consists of various tissue types like White Matter (WM), Gray Matter (GM), and **Cerebrospinal Fluid (CSF)**.

- By using GMM,

 - the aim is to **separate these tissue types based on their intensity values in the image**.

 - Each tissue type would have a distinct intensity distribution, and the

 - **GMM can model these distributions to identify and cluster similar voxels (3D pixels) together**.

### 2. Propagation of Labels from Segmented Atlases:

**Segmented Atlases**:

- These are previously **segmented images that have labels for various anatomical regions**.

- Atlases serve as a **reference** for segmenting new images.

**Deformable Registration**:
- Registration is the **process of aligning two or more images**.

- Deformable registration allows for **non-linear alignment**, meaning it can account for variations in anatomy between individuals.

- In this process, the atlas is **"deformed" or "warped"** to match the anatomy of the target image.

**Label Propagation**:

- Once the atlas is aligned with the **target image** through deformable registration, the labels from the atlas (**which indicate various anatomical regions**) can be **transferred or "propagated"** to the target image.

- This helps in defining specific anatomical regions in the target image based on the reference atlas.

In conclusion, the segmentation technique described uses a combination of **probabilistic clustering through GMM to differentiate tissue types and label propagation from atlases to identify anatomical regions**. The combination ensures both the differentiation of tissue types and the precise demarcation of anatomical structures.

**NOTE**: In machine learning (ML), the term "**salient feature**" refers to a feature that **stands out** or is **most noticeable or important** in representing a particular pattern within the data. Salient features play a significant role in determining the outcome or the characteristics of the data. **Recognizing and leveraging such features can greatly enhance the performance of ML models.**

## Feature extraction examples: Cardiac function

Extraction of quantitative markers of cardiac function depends on segmentation. For this purpose, in this study (Puyol-Antón et al.) utilized a convolutional neural network to segment various parts of the heart in 3D+T cardiac MRI using numerous annotated samples. We will discuss convolutional neural networks later in the course.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/extract3.png' width=600px >


From these segmentations, the volume of the left ventricle throughout the cardiac cycle is computed. The time-points with the maximum and minimum volumes, end diastole and end systole, are identified.

The ejection fraction is then calculated using the equation:

$$ EF = \frac{(LVEDV-LVESV)}{LVEDV} \times 100 $$

Summary:

- Cardiac function assessment through segmentation of 3D+T MRI images using specialized deep learning techniques.

- Parameters such as EF and GLS are derived from these segmentations.

- Details of left ventricle volumes include:
  - End diastolic (LVEDV)
  - End systolic (LVESV)
  
For further reading, refer to Puyol-Antón et al.'s "Regional multi-view learning for cardiac motion analysis: Application to identification of dilated cardiomyopathy patients”. Published in IEEE TMBE, 2018.

**NOTE: Annonated samples**

Annotated samples refer to data items (in this context, 3D+T cardiac MRI images) that have been **labeled or marked by experts or annotators** to indicate specific features or structures of interest. In the case of cardiac MRI, annotations might involve delineating different parts of the heart, such as the chambers, valves, or myocardium. These annotations provide ground truth information, which the convolutional neural network (CNN) can use during training to learn how to correctly segment these structures on new, unseen images. The more accurate and diverse the annotated samples are, the better the CNN will be at segmenting similar structures in new images.

## Formula for calculating ejection fraction (EF)

Picture this: You're watching a super detailed movie of your heart in action. This isn't just any regular film; it's captured using a special camera known as a **3D+T MRI**. This camera allows us to see every tiny detail of your heart from **all angles and across time**.

Now, these pictures can be a bit overwhelming. Every single detail of your heart is captured in these images. But, what if we just want to understand **how your heart pumps blood**? That's where "feature extraction" waltzes in like a detective, looking for clues about your heart's performance. It's like skimming a book to understand the main plot without getting bogged down by all the side stories.

Two moments capture our attention: when the balloon is fully **blown** (**end diastole**) and when it's most **squeezed out** (**end systole**). With these two measurements, we can calculate something called the **"ejection fraction" (EF)**. It's like finding out how much water the balloon pushes out in one go!

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/ejection.png' width=400px >

[source](https://www.nursingcenter.com/ncblog/august-2021/how-to-calculate-ejection-fraction)

The formula for calculating the ejection fraction (EF) is given by:

$$
EF = \left( \frac{LVEDV - LVESV}{LVEDV} \right) \times 100
$$

Here, we're just looking at the **difference between the balloon's maximum and minimum water levels as a percentage**. Don't worry about the letters too much; they're just fancy names for the balloon's maximum and minimum water levels.

Here's a breakdown of the terms:

- **EF (Ejection Fraction)**: This represents the percentage of blood that's pumped out of the left ventricle (a main pumping chamber of the heart) during each heartbeat.

- **LVEDV (Left Ventricular End-Diastolic Volume)**: This is the amount of blood in the left ventricle just before it contracts to pump the blood out. Essentially, it's the volume of the left ventricle when it's fullest.

- **LVESV (Left Ventricular End-Systolic Volume)**: This is the amount of blood remaining in the left ventricle after it has contracted. It's the volume of the left ventricle when it's least full.

The formula calculates the difference between the volume of blood in the left ventricle before and after it contracts. This difference is then divided by the full volume (before contraction) to get a fraction. Multiplying by 100 converts this fraction into a percentage, giving us the ejection fraction. This percentage helps in understanding how effectively the heart is pumping blood out during each beat.

### Feature extraction examples: QRS interval from ECG

Our third way to measure heart function is called the QRS interval. Instead of using MRI, we get this from an ECG.

To derive the QRS interval from an ECG signal, we start with processing the **raw ECG**.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/extract44.png' width=700px >


The first step involves **filtering the signal** to discard low-frequency drifts and any external noise. This gives us a cleaner waveform to work with.

Next, we determine the peaks which appear as l**ocal highs and lows (maxima and minima)** within the filtered signal.

To make sure we get the right points, we set limits on how close or how intense these points can be. By setting **certain thresholds**, like a **minimum height and distance between the peaks**, we can accurately pinpoint the Q, R, and S peaks.

After these peaks are detected, the next step is to compute the **average duration of the QRS interval** from the data gathered over the course of the ECG recording.



### **Types of features: Feature Categories**

Let’s now have a look at various types of features that we can use to train our machine learning models. Some features are directly observed or measured, and no feature extraction is needed. These are for example age, gender, height, or blood glucose levels.

1. **Direct Observations:**
   - Examples include age, gender, height, blood glucose levels, and others. These are straightforward measurements that don't need any additional processing.

2. **Morphological Features:**
   - These focus on the shape and structure of objects. Common ones are length, area, and volume. Think about measurements like the size of brain structures or the duration of QRS waves.

3. **Texture Descriptions:**
   - These dive into the statistical details of images or parts of images. Key terms here are mean, variance, kurtosis, and entropy. For instance, studying the texture of breast tissue samples can help detect cancer.

4. **Transform-based Attributes:**
   - This involves changing signals or images into different formats or domains. One example is moving to the frequency domain, which can help pinpoint abnormalities in things like ECG readings.

5. **Local Feature Descriptions:**
   - These emphasize specific points in images, like edges, blobs, or corners. They are typically highlighted using filtering techniques.

In summary, there's a wide variety of features that assist in training machine learning models. Some are direct and easy to see, while others require a deeper look or transformation to understand fully.

### Feature extraction example: Breast Cancer Classification Using Histological Images

We will be exploring the PatchCamelyon dataset, which includes histological images of breast tissues. These images showcase both healthy and cancerous cells. Our goal is to determine if texture descriptors, specifically GLCM statistical properties, or localized feature descriptors like DAISY can be indicators of cancer.

**Breast Cancer Classification Using Histological Images**

- **Dataset:** PatchCamelyon

- **Feature Extraction Techniques:**


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/extract6.png' width=500px >


  - Texture Descriptors (GLCM)

  - Localised Feature Descriptors (DAISY)

In the upcoming tutorial, you'll have the opportunity to do this yourself.

**Jupyter Notebook:** 8.4 Feature Extraction"

**NOTE: Histological images** are visual representations of tissue sections that have been prepared and stained to be observed under a microscope. The study of the microscopic anatomy of tissues is called histology. The main purpose of histological imaging is to diagnose diseases, study the anatomy and structure of tissues, and for various research purposes.

### Texture descriptors: Gray level co-occurrence matrix (GLCM)

Gray level co-occurrence matrix expresses how pairs of discretised intensities, or grey levels, of neighbouring pixels or voxels, are distributed along one of the image directions. In 2D, we calculate the matrix in x and y directions separately.

A gray level co-occurrence matrix is like a table that shows **how often certain pairs of colors or shades are next to each other in an image**. In a 2D picture, we look at pairs in left-right (x) and up-down (y) directions.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/extract7.png' width=500px >

Imagine you have a chart for the left-right direction. The chart shows we have one pair where the first color is 1 and the next color is 2. So, in our table, where we have row 1 and column 2, we put the number 1.

Now, for the up-down direction, we have another chart. Here, we see the pair (1,2) twice. So, in the same table, for row 1 and column 2, we put the number 2.

**NOTE: Discretized intensities** refer to **the process of converting a continuous range of pixel or voxel values (intensities) in an image into a limited number of discrete levels or bins.** This is often done to simplify the image data or to reduce the amount of data for analysis.

For example, consider an image where pixel intensities can vary continuously between 0 and 255. If this image is discretized into 8 levels, then the intensity range might be divided into 8 equally spaced bins, such as:
- 0-31
- 32-63
- 64-95
- ...
- 224-255

In this discretized version of the image, any pixel with an intensity between 0 and 31 would be assigned a new intensity value (perhaps the midpoint, 15), any pixel with an intensity between 32 and 63 would be assigned another value (perhaps 47), and so on.

Discretization is commonly used in methods like the Gray Level Co-occurrence Matrix (GLCM) to reduce the size of the matrix and to make computations more manageable. It's important to choose an appropriate number of discrete levels based on the image characteristics and the specific analysis task.

**The term "gray" in "Gray Level Co-occurrence Matrix" (GLCM)** refers to grayscale images. Grayscale images are those that have only shades of gray and no color. In such images, each pixel's value represents a shade of gray, ranging from black to white.

Unlike color images, which typically have three channels (red, green, blue) to represent color information, grayscale images have only one channel. The intensity value of each pixel in this channel can range from, for example, 0 (representing black) to 255 (representing white) in an 8-bit image.

GLCM is often used in the context of grayscale images to capture texture information by examining the spatial relationship of pixel intensities. **The word "gray" emphasizes that the method is primarily focused on variations in intensity rather than color.**

## Texture Description: Texture Descriptors

Once were recorded the frequencies in the matrix, we normalise it by the total number of discretised grey level intensity pairs. This way, we will convert the matrix to 2D probabilistic distribution. We can then calculate statistical properties of this distribution as our texture descriptors. In particular, we calculate contrast, dissimilarity, homogeneity, energy and correlation.

The Gray level co-occurrence matrix, or GLCM for short, is a way to understand the texture of an image. Think of it like a table that shows how often pairs of pixels with specific brightness values are found together in an image.

To make sense of the numbers in this table, we change them into percentages or probabilities. This makes it a 2D probability distribution, represented as \( p(i,j) \).

Now, from this table, we can figure out some cool things about the texture of our image:

1. **Contrast**: Measures the amount of local variations present. It's calculated using the formula:
$$ \sum_{i,j=0}^{l-1} (i-j)^2 p(i,j) $$

2. **Dissimilarity**: Gives us an idea about how different pairs of pixels are. The formula is:
$$ \sum_{i,j=0}^{l-1} |i-j| p(i,j) $$

3. **Homogeneity**: Tells us how close the elements of the matrix are to the matrix diagonal. In simpler words, how uniform our texture is. It's found using:
$$ \sum_{i,j=0}^{l-1} \frac{1}{1+(i-j)^2} p(i,j) $$

4. **Energy**: It's like a measure of uniformity or orderliness of the pixels. The higher the energy, the more order or regularity there is in the image. Formula:
$$ \sqrt{\sum_{i,j=0}^{l-1} p(i,j)^2} $$

5. **Correlation**: Shows how a pixel is related to its neighbor. The math behind it is a bit complex, but it's done using:
$$ \sum_{i,j=0}^{l-1} \frac{(i-\mu_i)(j-\mu_j)}{\sigma_I \sigma_j} p(i,j) $$

In simple words, once we've got our matrix set up and converted to percentages, these formulas help us describe the texture of our image in different ways!

### Texture description: Classification of breast cancer from histological images

Here we're comparing pictures of regular tissue and cancer tissue. It's clear there's a difference in GLCM patterns between them. In our tutorial, we'll dig deeper to find out which stats work best to spot the cancerous parts.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/extract8.png' width=700px >

About the PatchCamelyon dataset:
- We get the GLCM from shaded patches.
- We then look at the contrast, difference, evenness, energy, and relationship of each patch to figure out if it has cancer tissue in it."

## Localised feature descriptors: The DAISY descriptor

Its main goal of DAISY descriptor is to gather detailed features from an image, rather than just pinpointing standout points. It's all about understanding how colors or shades change in different directions. To do this, we first create maps of these changes using specific techniques. Afterward, we slightly blur these maps using different-sized filters and then pick information from set spots on them.


<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-5-Clustering%20and%20Features/imgs/extract9.png' width=400px >


Imagine an illustrative picture of how the DAISY descriptor works. There's a central red spot marking the DAISY's position on the image. There are other red and blue spots indicating where we're getting our information. Circles on this picture signify the blurring filters we use; bigger circles mean more blurring. Lines on the image show the direction of the color changes.

Once we've got all this info, we organize it into a neat list or vector for later use. In our upcoming tutorial, we'll be using the DAISY descriptor to help identify types of breast cancer in samples and see how it stacks up against GLCM.

Calculate gradients maps in different orientations
Blur gradient maps with Gaussian filters of different sizes
Sample blurred gradient maps at selected locations
Arrange the extracted features into a vector


+

We will have a closer look at the DAISY descriptor, which aims to extract dense features, rather than focus on salient points. DAISY descriptor collects information about gradients in various orientations in the image. We first calculate the gradient maps in different orientations using filtering techniques. Then we blur the gradient maps with Gaussian filters of different sizes and sample the gradient maps at pre-defined locations.

Here is the schematic image of the DAISY descriptor. The red dot is the location of the DAISY descriptor in the image. The red and blue dots show the locations, where we sample the gradient maps. The circles represent the Gaussian kernels used for blurring in each location. The radius of each circle is equal to the standard deviation of the Gaussian kernel.  The orientations of gradients are shown here by the lines plotted over the sampling locations.

Finally, the features extracted for each DAISY descriptor are arranged in a vector ready for further processing. In the tutorial we will apply DAISY descriptor to classify breast cancer histological samples and compare it to GLCM.

A quick recap:

- Calculate gradients maps in different orientations (Measure the changes in colors in different directions)

- Blur gradient maps with Gaussian filters of different sizes (ften these changes using Gaussian filters of different sizes)

- Sample blurred gradient maps at selected locations (Take samples from the softened maps at certain spots)

- Arrange the extracted features into a vector (Organize the gathered features into a line of data)


## Feature extraction in the era of deep learning

**Feature Engineering:**
This is like giving our computer a little manual on how to look at data. We're telling it, "Hey, check out these particular bits and pieces, they're super important!" In our chat today, I'll show you some cool techniques we've got up our sleeve for this.

**Feature Learning:**
Here's where things get futuristic! With deep neural networks, we basically let our computer play detective. Instead of us pointing out the clues, it finds them on its own. It's especially awesome for looking at things like pictures or sounds where there's a ton of info to sift through.

**Why Still Love Feature Engineering?**
Even with all the fancy auto-detective work, there's value in the old-school approach:

- We get to be the boss! We decide what features the computer should focus on.

- It's way easier to explain. If someone asks, "Why did the computer decide this?", we've got clear answers.

- It's super handy when we don't have a mountain of data, or when we really want to understand what's driving our computer's decisions.

In a nutshell, while letting the computer learn on its own is super powerful, there's still a special place in our hearts (and in many projects) for good old feature engineering!