What is a fully connected network (FCN for short)?

A fully connected network is a type of artifical neural network whose structure consists of fully connected layers. Fully connected layers are layers that are connected to other layers by the neurons/nodes in each layer. For example, if there is a FCN that has 3 layers with layer 1 having 10 neurons, layer 2 having 26 neurons, and layer 3 having 14 neurons. In this example, each one of the neurons in layer 1 has a connection to every single neuron in layer 2 and the same concept holds true for the relation between layer 2 and layer 3. What is also interesting is that you can figure out the amount of connections that are there between two layers. The equation for that the N1*N2 where N1 is the number of neurons in the layer on the left and N2 is the number of neurons on the right.


How do Fully Connected Networks work?

FCNs work by having a neuron/perceptron apply a linear transformation to the input vector through a weight matrix. Then a non-linear transformation is applied to the product of the input vector and weight matrix through a non linear activation function (show equation image).

Basically, we are taking the dot product of the weight matrix W and the input vector x. Then the bias term W0 will be added inside the non linear function. (A Bias term is a disproportionate weight in favor or against an idea or thing). In even simpler terms, we are doing vector multiplication. For example, we have an input vector of 1x9 and a weight matrix of 9x4. We will take the dot product of (1x9) and (9x4) and then apply the non-linear transformation with the activation function f to get an output vector of (1x4) (show the second image before the example and the third fcl image after the example).

An Activation function is a model that defines the output of a neuron/node given an input or set of input.

Some of the most commonly used activation functions are:
Binary,
Linear,
Sigmoid,
Tanh,
ReLU,
Leaky ReLU,
Parametric ReLU,
Exponential Linear Unit,
ReLU-6,
Softplus,
Softsign, 
Softmax, 
and Swish


How does a FCN differ from a CNN?

The biggest difference from what I seen online in doing my own research is that FCNs are structurally agnostic meaning that they don't make any special assumption about the input given whereas a CNN is designed to assume that the input are specifically images. 

This broad assumption that FCNs have can be quite useful if one wants to train different data. However, due to the broad assumption, the performance of a FCN is not a great compared to a neural network that is designed for a specific kind of input, like a CNN. Another advantage is that FCNs have more expressive power compared to CNNs due to convolution being linear.

The specific focus that a CNN is designed for is quite useful as one can process the input of images quite quickly compared to a FCN. However, the main disadvantage of this type of neural network is that you can only train the network on images and nothing else which is where the FCN comes in. Another advantage is that CNNs seem to be more efficient in utilizing their parameters. FCNs tend to require a greater number of parameters to compete to an equivalent CNN.

So, overall, depending on your needs, a FCN can be better than a CNN and vice versa.

What are some ways that FCNs can be used?

In searching for good real world examples of FCNs as a way to better explain FCNs, I came across three different papers that peaked my interest. 

The first one is called Intra Prediction using Fully Connected Network for Video Coding which is by Jihao Li, Bin Li, Jizheng Xu, and Ruiqin Xiong. 

The second one is called Fully Connected Network on Noncompact Symmetric Space and Ridgelet Transform based on Helgason Fourier Analysis which is by Sho Sonoda, Isao Ishikawa, and Masahiro Ikeda.

The third one is called How Far Can We go Without Convolution: Improving Fully Connected Networks which is by Zhouhan Lin, Roland Memisevic, and Kishore Konda.

I will go over the three world examples after my example code.

The goal of my example is to train my FCN over a dataset of 10000 different images by using the MNIST dataset.

In my example I am creating a FCN using the nn.Module class which consists of two fully connected layers with ReLu activation in between the two layers.

I also defined the forwward pass function to help compute the output of the model.

During the training of the FCN, the code should iterate over the training dataset in batches and also perform forward and backward.

In [None]:
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim

In [None]:
# Define the fully connected neural network model
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNetwork, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

In [None]:

# Set the device for training
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# Define hyperparameters
input_size = 784  # Input size of MNIST dataset (28x28 pixels)
hidden_size = 500
num_classes = 10
learning_rate = 0.001
batch_size = 100
num_epochs = 5

In [None]:
# Load the MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=torchvision.transforms.ToTensor(), download=True)
test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=torchvision.transforms.ToTensor())

# Create data loaders
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

In [None]:

# Initialize the neural network
model = NeuralNetwork(input_size, hidden_size, num_classes).to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [None]:
# Train the neural network
total_steps = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Reshape images to (batch_size, input_size)
        images = images.reshape(-1, input_size).to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{total_steps}], Loss: {loss.item():.4f}')

In [None]:
# Test the neural network
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, input_size).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print(f'Accuracy of the network on the 10000 test images: {(100 * correct / total):.2f}%')

A Brief Summary of Intra Prediction using Fully Connected Network for Video Coding which is by Jihao Li, Bin Li, Jizheng Xu, and Ruiqin Xiong:

Basically, what Dr. J Li, Dr. B Li, Dr. J Xu, and Dr. R Xiong wanted to achieve was to successfully apply Deep Neural Networks through FCNs to hopefully improve the SOTA intra prediction. The frame work that the four gentlemen took into consideration block-based adn they proposed using a FCN where all layers except non linear layers are fully connected.

The method that Dr Lis, Dr. Xu, and Dr. Xiong suggested in comparesion to the traditional method actually exploits a richer context of current blocks.

Currently the SOTA of video coding standard is HEVC or High Efficiency Video Coding. The four gentlemen were able to improve it by 1.1%.

Also, Dr. J Li, Dr. B Li, Dr. J Xu, and Dr. R Xiong were able to improve 4k by 1.6%. 

In the end, Dr. J Li, Dr. B Li, Dr. J Xu, and Dr. R Xiong were able to find out that intra prediction can be optimized for video coding by using the suggested version of a fully connected network. These four men found that a 128-dimensional IPFCN with 3 layers is the best way for video coding. They do state that the training set contained 48 sets and block size was 8x8 and they hoped to investigate this in the future.


A Brief Summary of Fully Connected Network on Noncompact Symmetric Space and Ridgelet Transform based on Helgason Fourier Analysis which is by Sho Sonoda, Isao Ishikawa, and Masahiro Ikeda:

Basically, what Dr. Sonoda, Dr. Ishikawa, and Dr. Ikeda wanted to accomplish is to present a fully connected network and its associated ridgeliet transform on a noncompact symmetric space using the framework of the Helgason-Fourier transform on a noncompact symmetric space.

A symmetric space is a Riemannian manifold (a manifold that is equipped with a postive inner product at each point of a tangent space) whose groupe of symmetries contains an inversion symmetry at every single point. A noncompact symmetric space is a symmetric space that has nonpostive sectional curvature (a way to describe the curve of a Riemannian manifold)

The Helgason Fourier Transform is a mathematical model used to help to transform signals/inputs between two different domains. An example of this is transfroming signals from a frequency domain to a time domain or vice versa. The Helgason Fourier transform is more specifically applied to noncompact Riemannian symmetric spaces.

A ridgelet transfrom is a right inversion operator of the intgeral representation of the operator S. (show pictures)

In the end, Dr. Sonoda, Dr. Ishikawa, and Dr. Ikeda were able to devise a fully connected layter on a non compact space and were able to present it on a closed form expression of a ridgelet transform. The three gentlemen go on furthur to state that given nice coordinates, they could turn it into a Fourier Expression and then maybe obtain a ridgelet transform from the coordinates. This is because what Dr. Sonoda, Dr. Ishikawa, and Dr. Ikeda did was similar to HNNs (hyperbolic neural networks which are specialized NNs for these kind of problems that Dr. Sonoda, Dr. Ishikawa, and Dr. Ikeda are trying to work on).

A Brief Summary of How Far Can We go Without Convolution: Improving Fully Connected Networks which is by Zhouhan Lin, Roland Memisevic, and Kishore Konda:

Basically, what Dr. Lin, Dr. Memisevic, and Dr. Konda wanted to accomplish was to improve the performance of fully connected networks and these three proposed two approaches that actually improve performance: linear bottlenecks layers and unsupervised pre-training with autoencoders.

A big advantage of linear bottleneck layers is that it counteracts the issue of sparscity in neural networks. The drawback of sparscity is that there will be a scarcity of data. However, by using linear bottleneck layers in FCNs is that the amount of data can increase, decrease, or stay the same and not have to deal with sparscity.

A big advantage of pre training with autoencoders is that the weight matrices are closer to the orthogonal and are less likely by vanishing gradient problems (what this means is that the value of the product of deriivate decreases until at some point, the partial derivate reaches a value close to 0 or actually hit 0 and then the partial derivate will disappear).

Ultimately, Dr. Lin, Dr. Memisevic, and Dr. Konda were able to improve the performance of FCNs through the discussed methods above. This also in part that linear bottleneck layers and pre-training with autoencoders have been an idea and somewhat used for a long period of time. However, the only downside is that the practicality of the improved performance is limited as to approximate any given function requires an extremely large number of hidden units.