# Computer Vision Homework 02 : Image Compression with Frequency Domain Techniques and Deep Learning

Contact: David C. Schedl (david.schedl@fh-hagenberg.at)

Note: this is the starter pack for the **Visual Computing** homework. You do not need to use this template!

This notebook combines concepts from previous exercises, focusing on image compression using both classical frequency-domain methods and modern deep learning approaches. We will be using the CelebA dataset for our experiments.

## Task Overview:
The goal is to implement and compare different image compression algorithms.

Choose at least two options from the following 3: 
1.  **Frequency Domain Compression:** Exploit the frequency domain (DFT/DCT) to compress images. Analyze how removing frequency coefficients affects image quality and compression ratio.
2.  **Deep Learning Compression:** Implement and train autoencoders (both MLP-based and convolutional) to learn compressed representations of images. Investigate the trade-off between latent dimension size and reconstruction quality.
3.  **Implicit Neural Representations:** Consider how INRs could be used for image representation and compression.

Analyze your algorithms on the CelebA dataset. Evaluate results concerning quality (e.g., MSE) and size reduction. Consider how parameters can be tuned for different quality/size trade-offs.

*Hint:* Work with resized images (e.g., 32x32, 64x64 or 128x128) and potentially smaller network architectures to speed up computation and manage storage requirements, especially during initial exploration.

**Have fun!** 😸

In [None]:
# %%capture
# use %% capture suppress any output for pip install
%pip install tensorflow_datasets
%pip install tensorflow
%pip install bs4


import numpy as np
import matplotlib.pyplot as plt
import datetime
import tensorflow as tf
import tensorflow_datasets as tfds
import cv2 # OpenCV for DFT/DCT

from tensorflow.keras import Model
from tensorflow.keras.models import Sequential
import tensorflow.keras.losses as losses
from tensorflow.keras.layers import (Dense, Flatten, Conv2D, MaxPooling2D,
                                     Conv2DTranspose, Reshape, Input)

from __future__ import absolute_import, division, print_function, unicode_literals

def MSE(A,B):
  """compute the mean squared error (MSE) between numpy array A and B
  """
  # Ensure A and B are float types for subtraction, and normalize if they are in 0-255 range
  A = A.astype(np.float32)
  B = B.astype(np.float32)
  if A.max() > 1.0: A = A / 255.0
  if B.max() > 1.0: B = B / 255.0
  return ((A - B)**2).mean(axis=None)

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))


## Dataset: CelebA (CelebFaces Attributes Dataset)

CelebA is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. We will use these images for our compression experiments. We'll resize them to a smaller dimension (e.g., 32x32 or 64x64) for manageability.

We will use `tensorflow_datasets` to load and preprocess the data.

In [None]:
IMG_HEIGHT = 32
IMG_WIDTH = 32
IMG_CHANNELS = 3
IMAGE_SHAPE = (IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS)
BATCH_SIZE = 64 # Adjusted for potentially larger dataset

def preprocess_celeba(features):
    image = features['image']
    image = tf.image.resize(image, [IMG_HEIGHT, IMG_WIDTH])
    # Crop to center to ensure 64x64 if resize introduces non-square aspect from original
    # Or ensure original aspect ratio is maintained then crop, common practice for CelebA
    # For simplicity, we'll assume resize gets us close enough for this exercise
    # image = tf.image.central_crop(image, central_fraction=0.8) # Example if needed
    # image = tf.image.resize(image, [IMG_HEIGHT, IMG_WIDTH]) # Resize again after crop
    image = tf.cast(image, tf.float32) / 255.0
    return image, image # For autoencoders, input and target are the same

# Load CelebA dataset
ds_builder = tfds.builder('celeb_a')
ds_builder.download_and_prepare()

ds_train = ds_builder.as_dataset(split='train', shuffle_files=True)
ds_train = ds_train.map(preprocess_celeba, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

ds_test = ds_builder.as_dataset(split='test', shuffle_files=False) # No need to shuffle test
ds_test = ds_test.map(preprocess_celeba, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

print(f"Training dataset: {ds_train}")
print(f"Test dataset: {ds_test}")

# Display some sample images
plt.figure(figsize=(10, 5))
for i, (image, _) in enumerate(ds_train.take(1).unbatch().take(5)):
    ax = plt.subplot(1, 5, i + 1)
    plt.imshow(image.numpy())
    plt.title(f"Sample {i+1}")
    plt.axis("off")
plt.suptitle("Sample Training Images from CelebA (Resized)")
plt.show()

# Get a single image for DFT/DCT examples
sample_celeba_image_rgb_normalized = next(iter(ds_train.take(1).unbatch()))[0].numpy()
sample_celeba_image_rgb_0_255 = (sample_celeba_image_rgb_normalized * 255).astype(np.uint8)


# Option 1: Frequency Domain Compression

In this part, we'll explore using the Discrete Fourier Transform (DFT) and Discrete Cosine Transform (DCT) to compress images. The core idea is that many images have significant information concentrated in lower frequencies. By transforming an image into its frequency components, we can discard or quantize less important (often high-frequency) components to achieve compression.

### Frequencies: DFT and DCT

The Discrete Fourier Transformation and the Discrete Cosine Transform convert a signal (like an image) into its constituent frequencies. Unlike the complex-valued output of a standard Fast Fourier Transformation (FFT) on real signals, OpenCV's DFT can be handled to work with magnitudes, and DCT directly produces real-numbered coefficients, which can be easier to work with for certain compression schemes.

Below are examples using OpenCV's DFT and DCT implementations on a sample CelebA image.

In [None]:
# Convert the sample CelebA image to grayscale for DFT/DCT
gray_celeba = cv2.cvtColor(sample_celeba_image_rgb_0_255, cv2.COLOR_RGB2GRAY)

# DFT Example
# Transform the image to frequency domain using DFT
# OpenCV's DFT output is a 2-channel array (real and imaginary parts)
dft_input = gray_celeba.astype(np.float32) # DFT works best with float32
dft_output = cv2.dft(dft_input, flags=cv2.DFT_COMPLEX_OUTPUT)

# Shift the zero-frequency component to the center for visualization
dft_shift = np.fft.fftshift(dft_output)

# Calculate magnitude spectrum: log(1 + sqrt(real^2 + imag^2))
magnitude_spectrum = 20 * np.log(cv2.magnitude(dft_shift[:,:,0], dft_shift[:,:,1]) + 1)

# Inverse DFT
# First, shift back
f_ishift = np.fft.ifftshift(dft_shift)
# Then, inverse DFT
img_back_dft = cv2.idft(f_ishift, flags=cv2.DFT_SCALE | cv2.DFT_REAL_OUTPUT) # Scale and get real output
img_back_dft = np.clip(img_back_dft, 0, 255).astype(np.uint8)

plt.figure(figsize=(18, 6))
plt.subplot(131), plt.imshow(gray_celeba, cmap='gray'), plt.title('Input Grayscale Image')
plt.subplot(132), plt.imshow(magnitude_spectrum, cmap='gray'), plt.title('DFT Magnitude Spectrum')
plt.subplot(133), plt.imshow(img_back_dft, cmap='gray'), plt.title('Image after IDFT')
plt.show()

print(f'MSE between original and DFT reconstructed: {MSE(gray_celeba, img_back_dft)}')


In [None]:
# DCT Example
dct_input = gray_celeba.astype(np.float32) # DCT also needs float input

# Transform the image to frequency domain using DCT
dct_output = cv2.dct(dct_input)
log_dct_spectrum = np.log(np.abs(dct_output) + 1) # Log scale for visualization

# Inverse DCT
img_back_dct = cv2.idct(dct_output)
img_back_dct = np.clip(img_back_dct, 0, 255).astype(np.uint8)

plt.figure(figsize=(18, 6))
plt.subplot(131), plt.imshow(gray_celeba, cmap='gray'), plt.title('Input Grayscale Image')
plt.subplot(132), plt.imshow(log_dct_spectrum, cmap='gray'), plt.title('DCT Spectrum (Log Scaled)')
plt.subplot(133), plt.imshow(img_back_dct, cmap='gray'), plt.title('Image after IDCT')
plt.show()

print(f'MSE between original and DCT reconstructed: {MSE(gray_celeba, img_back_dct)}')


### Your Task (Frequency Domain Compression):
1.  **Implement Compression:** Modify the DFT or DCT process. After transforming the image to the frequency domain, set a certain percentage of the smallest (in magnitude) coefficients to zero, or all coefficients outside a certain low-frequency region (e.g., top-left corner for DCT, center for shifted DFT).
2.  **Analyze:** How does varying the number of removed/zeroed coefficients affect the reconstructed image quality (visual and MSE)?
3.  **Evaluate:** Estimate the compression ratio. For example, if you keep only K out of N coefficients, how does this translate to potential bit savings (e.g., by only storing non-zero coefficients and their positions, or using run-length encoding on zeroed coefficients)?
4.  **Parameterize:** Can you make the image quality / size reduction a parameter of your compression algorithm (e.g., a quality factor or a percentage of coefficients to keep)?

# Option 2: Autoencoders

An autoencoder is a neural network that learns efficient codings of unlabeled data. It consists of two parts: an **encoder** that maps the input to a lower-dimensional latent representation, and a **decoder** that reconstructs the input from this latent representation. If the latent representation is significantly smaller than the input data, the autoencoder effectively performs compression.

We will implement and train autoencoders on the CelebA dataset.

### Example AutoEncoder with Multi-Layer Perceptrons (Dense layers)

This is a simple autoencoder using dense layers. The encoder flattens the image and maps it to a latent vector, and the decoder maps this vector back to the image shape.

In [None]:
class AutoencoderMLP(Model):
  def __init__(self, latent_dim, image_shape):
    super(AutoencoderMLP, self).__init__()
    self.latent_dim = latent_dim
    self.image_shape = image_shape
    self.encoder = tf.keras.Sequential([
      Flatten(),
      Dense(latent_dim * 4, activation='relu'), # Intermediate layer
      Dense(latent_dim, activation='relu'),
    ])
    self.decoder = tf.keras.Sequential([
      Dense(latent_dim * 4, activation='relu'), # Intermediate layer
      Dense(np.prod(self.image_shape), activation='sigmoid'), # Sigmoid for [0,1] output
      Reshape(self.image_shape)
    ])

  def call(self, x):
    encoded = self.encoder(x)
    decoded = self.decoder(encoded)
    return decoded

latent_dim_mlp = 32
autoencoder_mlp = AutoencoderMLP(latent_dim_mlp, IMAGE_SHAPE)
autoencoder_mlp.compile(optimizer='adam', loss=losses.MeanSquaredError())

# Build the model by passing a sample batch shape
autoencoder_mlp(tf.ones((1,) + IMAGE_SHAPE))
autoencoder_mlp.summary()

### Training the MLP Autoencoder

In [None]:
epochs_mlp = 10 # Adjust as needed, CelebA might need more epochs
print(f"Training MLP Autoencoder for {epochs_mlp} epochs...")
history_mlp = autoencoder_mlp.fit(ds_train,
                                epochs=epochs_mlp,
                                shuffle=True, # Already shuffled by ds_train, but good practice
                                validation_data=ds_test,
                                callbacks=[tensorboard_callback],
                                verbose=1)

### Encoding / Decoding with MLP Autoencoder

In [None]:
print("Encoding and decoding test images with MLP Autoencoder...")
# Get a batch of test images for display
test_images_batch, _ = next(iter(ds_test))

encoded_imgs_mlp = autoencoder_mlp.encoder(test_images_batch).numpy()
decoded_imgs_mlp = autoencoder_mlp.decoder(encoded_imgs_mlp).numpy()

print(f"Shape of encoded images (MLP): {encoded_imgs_mlp.shape}")

n = 5 # Number of images to display
plt.figure(figsize=(20, 8))
for i in range(n):
  # Display original
  ax = plt.subplot(2, n, i + 1)
  plt.imshow(test_images_batch[i].numpy())
  plt.title("original")
  ax.get_xaxis().set_visible(False)
  ax.get_yaxis().set_visible(False)

  # Calculate MSE for this image
  mse_val = MSE(test_images_batch[i].numpy(), decoded_imgs_mlp[i])

  # Display reconstruction
  ax = plt.subplot(2, n, i + 1 + n)
  plt.imshow(decoded_imgs_mlp[i])
  plt.title(f"MSE:{mse_val:.2E}")
  ax.get_xaxis().set_visible(False)
  ax.get_yaxis().set_visible(False)
plt.suptitle("MLP Autoencoder: Original vs. Reconstructed")
plt.show()

### Example AutoEncoders with Convolutional Layers

Convolutional Autoencoders (CAEs) typically perform better for image data as they preserve spatial structures. The encoder uses convolutional layers to downsample the image, and the decoder uses transposed convolutional layers to upsample it back.

**Note on `IMAGE_SHAPE`**: Ensure `IMAGE_SHAPE` (defined during CelebA loading as `(64, 64, 3)`) is used by these models.

In [None]:
# Convolutional Autoencoder V1 (No dense latent vector, latent space is a feature map)
class ConvAutoencoderV1(Model):
  def __init__(self, image_shape):
    super(ConvAutoencoderV1, self).__init__()
    self.image_shape = image_shape
    # Encoder: (64,64,3) -> (32,32,16) -> (16,16,32) -> (8,8,64)
    self.encoder = tf.keras.Sequential([
      Input(shape=self.image_shape),
      Conv2D(16, (3, 3), activation='relu', padding='same', strides=2),
      Conv2D(32, (3, 3), activation='relu', padding='same', strides=2),
      Conv2D(64, (3, 3), activation='relu', padding='same', strides=2),
    ])

    # Decoder: (8,8,64) -> (16,16,32) -> (32,32,16) -> (64,64,3)
    self.decoder = tf.keras.Sequential([
      Conv2DTranspose(32, kernel_size=3, strides=2, activation='relu', padding='same'),
      Conv2DTranspose(16, kernel_size=3, strides=2, activation='relu', padding='same'),
      Conv2DTranspose(self.image_shape[2], kernel_size=3, strides=2, activation='sigmoid', padding='same') # Output channels = 3
    ])

  def call(self, x):
    encoded = self.encoder(x)
    decoded = self.decoder(encoded)
    return decoded

autoencoder_conv_v1 = ConvAutoencoderV1(IMAGE_SHAPE)
autoencoder_conv_v1.compile(optimizer='adam', loss=losses.MeanSquaredError())

autoencoder_conv_v1(tf.ones((BATCH_SIZE,) + IMAGE_SHAPE))
autoencoder_conv_v1.summary()

In [None]:
# Convolutional Autoencoder V2 (with a dense latent vector)
class ConvAutoencoderV2(Model):
  def __init__(self, latent_dim, image_shape):
    super(ConvAutoencoderV2, self).__init__()
    self.latent_dim = latent_dim
    self.image_shape = image_shape

    # Calculate shape before Flatten in encoder
    # Input: (64,64,3)
    # Conv1 (strides 2): (32,32,16)
    # Conv2 (strides 2): (16,16,32)
    # Conv3 (strides 2): (8,8,64) <- This is self.shape_before_flatten
    self.shape_before_flatten = (image_shape[0]//8, image_shape[1]//8, 64)

    self.encoder = tf.keras.Sequential([
      Input(shape=self.image_shape),
      Conv2D(16, (3, 3), activation='relu', padding='same', strides=2),
      Conv2D(32, (3, 3), activation='relu', padding='same', strides=2),
      Conv2D(64, (3, 3), activation='relu', padding='same', strides=2),
      Flatten(),
      Dense(self.latent_dim, activation='relu')
    ])

    self.decoder = tf.keras.Sequential([
      Input(shape=(self.latent_dim,)),
      Dense(np.prod(self.shape_before_flatten), activation='relu'),
      Reshape(self.shape_before_flatten),
      Conv2DTranspose(32, kernel_size=3, strides=2, activation='relu', padding='same'),
      Conv2DTranspose(16, kernel_size=3, strides=2, activation='relu', padding='same'),
      Conv2DTranspose(self.image_shape[2], kernel_size=3, strides=2, activation='sigmoid', padding='same')
    ])

  def call(self, x):
    encoded = self.encoder(x)
    decoded = self.decoder(encoded)
    return decoded

latent_dim_conv = 128 # Example latent dimension for ConvAE V2
autoencoder_conv_v2 = ConvAutoencoderV2(latent_dim_conv, IMAGE_SHAPE)
autoencoder_conv_v2.compile(optimizer='adam', loss=losses.MeanSquaredError())

autoencoder_conv_v2(tf.ones((BATCH_SIZE,) + IMAGE_SHAPE))
autoencoder_conv_v2.summary()

### Your Task (Autoencoders):
1.  **Train the Autoencoders:** Train `ConvAutoencoderV1` and `ConvAutoencoderV2` (and optionally the MLP one further) on the CelebA dataset. Monitor training using TensorBoard.
2.  **Experiment with `latent_dim`:** For `AutoencoderMLP` and `ConvAutoencoderV2`, vary the `latent_dim`. How does this affect reconstruction quality and the potential compression ratio?
3.  **Analyze Network Architecture:** How do choices in the number of layers, filter sizes, or activation functions impact performance?
4.  **Evaluate:** Compare the reconstructed images with the originals (visually and using MSE). How does the compression achieved by autoencoders compare to the frequency-domain methods and to standard image formats (e.g., JPEG at different quality levels, if you want to go further)?
5.  **Size Reduction:** The size of the latent representation (`latent_dim` for MLP/ConvV2, or the dimensions of the feature map for ConvV1) directly relates to the compressed size. Discuss how you would store these latent vectors/tensors and how that compares to the original image size in bits.

# Option 3: Implicit Neural Representations (INRs)

Implicit Neural Representations (INRs), sometimes called coordinate-based neural networks, offer a different paradigm for representing signals like images. Instead of learning a discrete grid of pixels, an INR learns a continuous function `f(coordinate) -> value` (e.g., `f(x,y) -> (R,G,B)`).

The network itself (its weights) becomes the compressed representation of the image. To reconstruct the image, you query the network with all pixel coordinates.

**Key characteristics:**
*   **Continuous Representation:** Can be sampled at arbitrary resolutions.
*   **Compression:** The size of the network parameters determines the compressed size.
*   **Positional Encoding:** Often crucial for INRs to learn high-frequency details.

For a starter code and more detailed explanation, refer to tutorials on INRs/IMLPs, such as:
*   [Implicit Neural Representations with PyTorch Tutorial (NeRF-related)](https://colab.research.google.com/github/Digital-Media/vco/blob/main/11_IMLP.ipynb) 

### Your Task (Implicit Neural Representations - Exploration):
1.  **Understand the Concept:** Review the principles of INRs for image representation.
2.  **Experiment (Optional but Encouraged):** If time permits, try to fit an INR (e.g., a simple MLP with positional encoding) to a single CelebA image.
    *   How does network size (layers, neurons) affect the quality of the fit and the compression (model size)?
    *   How many training epochs are needed?
    *   How does positional encoding influence the result?
3.  **Compare:** Conceptually, how does compression with INRs compare to autoencoders and frequency-domain methods in terms of:
    *   How the compressed data is stored?
    *   Decoding speed?
    *   Ability to represent details?

# Final Report and Comparison

Summarize your findings from all parts of this exercise in a report.
*   Compare the different compression techniques (DFT/DCT, MLP Autoencoder, Convolutional Autoencoders, and conceptually INRs).
*   Discuss their strengths and weaknesses regarding compression ratio, reconstruction quality, computational cost (training/inference), and parameter tuning.
*   What are the trade-offs involved in choosing a particular method or its parameters?
*   How do these methods compare to your results from previous homeworks (if applicable, e.g., HW01's frequency compression on CIFAR-10 vs. CelebA)?