# Vector Based Rotation Invariant Convolutional Neural Networks

## Team Members
- Nikolas Anagostou: Team Member
- Daniel Gove: Team Member
- Zachary Varnum: Team Leader

## Introduction

In traditional convolutional neural networks (CNNs), the filters applied during the convolution process are sensitive to the orientation of features within the input data. This characteristic can reduce the model's effectiveness in applications where orientation is variable and not indicative of class distinctions. To address this limitation, our project proposes a novel convolutional architecture that utilizes the magnitude and phase of vectors in 2D space. These vectors represent both the convolutional response and the angle of rotation of the filter that produced that response, aiming to create a rotation-invariant CNN. This approach is expected to enhance model robustness by decoupling feature recognition from specific orientations.


## Proposed Method

### Concept Overview

In our architecture, we explore a novel approach to achieving rotation invariance in convolutional neural networks through vector representations of the convolutional filters and responses. By utilizing both Cartesian and polar forms of vectors, our method leverages the inherent properties of these coordinate systems—vector addition in Cartesian and rotation-scaling in polar—to effectively handle orientations in image data.


### Vector Representations

#### Cartesian and Polar Coordinates

Vectors can be represented in two primary forms:

- **Cartesian Coordinates** (x, y): Ideal for operations such as vector addition. For example, to add two vectors $(x_1, y_1)$ and $(x_2, y_2)$, the resulting vector is $(x_1+x_2, y_1+y_2)$.
  
  ![Diagram of Vector Addition](URL_to_diagram_vector_addition)

- **Polar Coordinates** (r, θ): More suited for operations involving rotations and scaling. A vector in polar form $(r, θ)$ can be rotated by an angle $\phi$ and scaled by a factor $a$ resulting in $(ar, θ+\phi)$.

  ![Diagram of Vector Rotation and Scaling](URL_to_diagram_vector_rotation_scaling)

#### Conversion between Cartesian and Polar

Conversion between these coordinate systems is governed by the following formulas:
- To Polar: $r = \sqrt{x^2 + y^2}, \quad \theta = \tan^{-1}\left(\frac{y}{x}\right)$
- To Cartesian: $x = r \cos(\theta), \quad y = r \sin(\theta)$

### Vector Convolution Operation

#### Theoretical Formulation

Let $F$ represent a convolutional filter intended for application on input data. For a given rotation angle $\theta$, the filter is transformed by the rotation matrix $R_{\theta}$, yielding the rotated filter $F \cdot R_{\theta}$. The convolutional response to an input $x$ under this transformation is defined as:
$$
f(x, \theta) = (F \cdot R_{\theta}) \ast x,
$$
where $\ast$ signifies the convolution operation.

To encapsulate the orientation dynamics within the convolutional framework, we introduce a vector convolution operation, formalized as follows:
$$
v(x, \theta) = f(x, \theta) \cdot e^{i\theta},
$$
which maps the convolutional output to a vector in polar coordinates $(a, \theta)$. Here, $a$ denotes the magnitude of the response, and $\theta$ indicates the orientation of the applied filter. The factor $e^{i\theta}$, a unit complex number, serves to rotate the output vector by $\theta$ radians, thereby aligning the response's orientation with that of the filter's application. This formulation leverages the rotational properties of complex numbers to seamlessly integrate orientation information into the network's output.

#### Handling Negative Responses

One inherent challenge in this vector convolution framework is that convolutional responses can be negative, whereas magnitudes in polar coordinates are inherently non-negative. To resolve this, we reinterpret a negative response $(-a, \theta)$ as $(a, \theta + \pi)$, effectively treating the angle $\theta$ as the orientation modulo $\pi$. This convention, while allowing for representation of negative amplitudes, introduces ambiguity in angle $\theta$, as it no longer uniquely identifies the filter orientation.


### Iterative Rotation for Orientation Variance

To robustly capture orientation variances, our model systematically applies $N$ discrete rotations to the convolutional filter $F$. These rotations are encapsulated within a tensor $R$, where each slice $R_n$ represents a rotation matrix for a specific angle $\theta_n$. The process can be mathematically described as follows:

$$
F_n = F \cdot R_n \quad \text{for} \quad n = 1, 2, ..., N
$$

where $F_n$ represents the rotated filter corresponding to the $n$-th orientation. This sequence of operations ensures a comprehensive exploration of potential orientations, which is critical for achieving rotation invariance in the network's processing capabilities.

This iterative application of rotations is encapsulated within:
$$
\mathcal{R}(F) = \{F \cdot R_1, F \cdot R_2, \dots, F \cdot R_N\}
$$

Here, $\mathcal{R}(F)$ denotes the set of all rotated versions of the filter $F$, each transformed by a rotation matrix $R_n$ corresponding to an angle $\theta_n$. This methodical rotation facilitates not only a thorough analysis of orientation dynamics but also forms the foundation for the rotation-invariant properties of our network.

By methodically iterating over these orientations, the network ensures it maintains high performance irrespective of the angular disposition of input features. This capability is particularly valuable in applications where feature orientation varies significantly, thereby influencing the accuracy of the convolutional analysis.





### Depthwise Magnitude Max Pooling

Following the iterative rotation convolution step, our architecture incorporates a depthwise magnitude max pooling operation. This process can be mathematically represented as follows:

Let $V$ be the tensor output by the vector convolution operation across multiple orientations, where $V$ has dimensions $[H, W, N, C]$:
- $H$ and $W$ are the height and width of the feature map,
- $N$ is the number of different orientations,
- $C$ is the number of channels (filters).

The depthwise magnitude max pooling operates on the magnitude of the responses from different orientations for each spatial location $(i, j)$ and each channel $c$. Mathematically, it is defined as:
$$
M_{ijc} = \max_{n=1}^N |V_{ijnc}|,
$$
where $|V_{ijnc}|$ denotes the magnitude of the vector response at position $(i, j)$, for orientation $n$, and channel $c$. The operation selects the maximum magnitude across all orientations $N$, preserving the orientation $\theta_n$ corresponding to this maximal response:
$$
\Theta_{ijc} = \theta_n \quad \text{such that} \quad n = \operatorname{argmax}_{n=1}^N |V_{ijnc}|.
$$

This pooling method effectively identifies and preserves the most prominent orientation for each filter at each spatial location, optimizing the filter's detection capabilities in the processed input. The output tensor $M$ with elements $M_{ijc}$ represents the pooled feature map, and $\Theta$ with elements $\Theta_{ijc}$ captures the optimal orientations.

### Vector Feature Map Integration and Transformation

#### Applying Convolution Weights in Polar Form

Given that the vector feature maps are stored in polar form $(r, \theta)$, using polar coordinates for applying convolutional weights offers clear advantages for rotation-sensitive tasks:
- **Simplified Rotation Handling**: Rotations are managed by simply adjusting the angle component $\theta$, which is inherently simpler and more efficient than the Cartesian coordinate transformations.
- **Direct Manipulation of Magnitude and Phase**: Polar coordinates allow direct scaling of magnitudes and straightforward rotational shifts by modifying $\theta$, aligning features based on their orientation.

The convolution weights are also in polar form $(a_i, \theta_i)$, and multiplication of two polar coordinates is performed as:
$$
r' = r \cdot a_i, \quad \theta' = \theta + \theta_i
$$
This results in a new vector $(r', \theta')$, effectively scaling the magnitude and rotating the phase of each input vector. This operation is optimal for tasks where the alignment and scale relative to different orientations are crucial.

#### Summation and Feature Interaction in Cartesian Coordinates

After applying the weights, the vectors, now transformed to $(r', \theta')$, are converted to Cartesian coordinates to facilitate summation:
$$
x = r' \cos(\theta'), \quad y = r' \sin(\theta')
$$
The sum of these Cartesian components across all applied filters is computed for each feature map location:
$$
X = \sum_{i=1}^N x_i, \quad Y = \sum_{i=1}^N y_i
$$
where $N$ is the number of filters or transformations applied. This summation allows for the aggregation of features from various orientations, enhancing coherent (aligned) features and reducing incoherent (misaligned) ones.

After summing, the average is typically taken to normalize the results, especially in deep convolutional layers where managing feature scale is critical:
$$
X_{avg} = \frac{X}{N}, \quad Y_{avg} = \frac{Y}{N}
$$
This averaging process helps in stabilizing the learning by reducing the variance of the output values, making the model less sensitive to the specific number of filters used.

#### Reconversion to Polar and Introduction of Non-Linearity

The summed and averaged Cartesian components $(X_{avg}, Y_{avg})$ are then converted back into polar coordinates to prepare for non-linear activation:
$$
R = \sqrt{X_{avg}^2 + Y_{avg}^2}, \quad \Phi = \operatorname{atan2}(Y_{avg}, X_{avg})
$$
A polar ReLU function is applied to introduce non-linearity, crucial for capturing non-linear relationships within the data:
$$
R' = \begin{cases}
R & \text{if } R \geq 1 \\
0 & \text{otherwise}
\end{cases}
$$
This operation enhances model complexity and discriminative power by ensuring that only significant magnitudes contribute to the network’s further layers.

### Further Processing and Integration into Traditional Architectures

#### Transition to Standard Convolutional Layers

Following the specialized operations of applying convolutional weights in polar form and converting between coordinate systems, the feature maps are processed using standard convolutional layers. These layers are defined mathematically by:
$$
F_{out} = \sigma\left(W * F_{in} + b\right)
$$
where $*$ denotes the convolution operation, $W$ represents the convolutional kernels, $b$ is the bias, $F_{in}$ is the input feature map from the previous polar operations, $\sigma$ is the activation function (e.g., ReLU), and $F_{out}$ is the output feature map.

#### Integration of Pooling Layers

To reduce spatial dimensions and enhance feature robustness against small variations and noise, pooling layers are integrated:
$$
P_{out} = \text{pool}(F_{out})
$$
Here, $\text{pool}$ can be a max pooling operation where the maximum value within a specified window is selected:
$$
P_{out}(i, j) = \max_{a, b \in W}(F_{out}(i+a, j+b))
$$
This operation reduces the size of each feature map while preserving the most prominent features, enhancing translational invariance.

#### Applying Magnitude Max Pooling for Downsampling

Additionally, to focus on the most relevant features in terms of magnitude, especially after handling complex transformations:
$$
M_{out} = \max_{\theta \in \Theta}( |F_{\theta}| )
$$
where $|F_{\theta}|$ represents the magnitude of the vector feature at each point, considering different orientations $\Theta$. This magnitude max pooling helps in downsampling the feature maps by selecting the dominant features across different rotations, thus reinforcing the network's rotation invariance.

#### Normalization and Final Layers

Before proceeding to the output layers, normalization techniques such as Batch Normalization can be applied:
$$
F_{norm} = \gamma \left(\frac{F_{out} - \mu_{B}}{\sqrt{\sigma^2_{B} + \epsilon}}\right) + \beta
$$
where $\mu_B$ and $\sigma^2_B$ are the mean and variance of the batch, $\epsilon$ is a small constant to prevent division by zero, and $\gamma$, $\beta$ are parameters learned during training to scale and shift the normalized data.

#### Output Layer and Classification

Finally, the processed feature maps can be flattened and fed into a dense layer for classification, or further convolutional layers depending on the specific task:
$$
\text{Output} = \text{Softmax}(W_{final} \cdot \text{Flatten}(F_{norm}) + b_{final})
$$
This structure allows the integration of the specialized vector-based convolutional features with standard layers of a CNN, ensuring that the network is suitable for complex tasks requiring understanding of rotation-invariant features.

### Conclusion

This integration strategy ensures that the novel polar-based convolution operations are seamlessly combined with traditional CNN architectures, allowing for effective learning and generalization across varied visual tasks. The inclusion of both standard and novel pooling methods enhances the network's ability to handle different spatial and rotational variances in input data.
