# Overview of Deep Neural Networks

Shallow neural networks demonstrate an improvement in their descriptive power as the number of hidden units increases. It has been established that, with a sufficient number of hidden units, shallow networks can approximate arbitrarily complex functions in high-dimensional spaces. Conversely, deep networks, by virtue of their architecture, are capable of generating a significantly higher number of linear regions with the same number of parameters as shallow networks. This intrinsic capability allows them to describe a more extensive family of functions, making them more practical for diverse applications.

To elucidate the behavior of deep neural networks, an approach involves composing two shallow networks where the output of the first network serves as the input to the second.

Consider two such networks, each with three hidden units. The first network, processing input (x) to produce output (y), is defined by the equations:

$$\begin{align*}
h_1 &= a|\theta_{10}+\theta_{11}x|, \\
h_2 &= a|\theta_{20}+\theta_{21}x|, \\
h_3 &= a|\theta_{30}+\theta_{31}x|, \\
y &= \phi_0 + \phi_1h_1 + \phi_2h_2 + \phi_3h_3.
\end{align*}$$

The subsequent network takes (y) as input to generate (y'), formulated as:

$$\begin{align*}
h'_1 &= a|\theta'_{10}+\theta'_{11}y|, \\
h'_2 &= a|\theta'_{20}+\theta'_{21}y|, \\
h'_3 &= a|\theta'_{30}+\theta'_{31}y|, \\
y' &= \phi'_0 + \phi'_1h'_1 + \phi'_2h'_2 + \phi'_3h'_3.
\end{align*}$$

This composition implies that the first network "folds" the input space (x) onto itself, leading to multiple inputs producing identical outputs. The second network then applies a function that is replicated across all points that were folded together.

When linear functions are applied sequentially, the result is another linear function. Therefore, the output of the composed shallow networks can be rewritten as:

$$\begin{align*}
h'_1 &= a|\psi_{10}+\psi_{11}h_1 + \psi_{12}h_2 + \psi_{13}h_3|, \\
h'_2 &= a|\psi_{20}+\psi_{21}h_1 + \psi_{22}h_2 + \psi_{23}h_3|, \\
h'_3 &= a|\psi_{30}+\psi_{31}h_1 + \psi_{32}h_2 + \psi_{33}h_3|.
\end{align*}$$

Considering a general deep network with two hidden layers, each comprising three hidden units, we represent:

$$\begin{align*}
h_1 &= a|\theta_{10}+\theta_{11}x| \\
h_2 &= a|\theta_{20}+\theta_{21}x| \\
h_3 &= a|\theta_{30}+\theta_{31}x|
\end{align*}$$

$$\begin{align*}
h'_1 &= a|\psi_{10}+\psi_{11}h_1 + \psi_{12}h_2 + \psi_{13}h_3| \\
h'_2 &= a|\psi_{20}+\psi_{21}h_1 + \psi_{22}h_2 + \psi_{23}h_3| \\
h'_3 &= a|\psi_{30}+\psi_{31}h_1 + \psi_{32}h_2 + \psi_{33}h_3|
\end{align*}$$

$$y' = \phi'_0 + \phi'_1h'_1 + \phi'_2h'_2 + \phi'_3h'_3$$

**This formulation presents a nuanced perspective on network functionality, illustrating how complexity is constructed layer by layer.**

The network's width, denoted by the number of hidden units per layer, and its depth, indicated by the total number of hidden layers, are critical hyperparameters that influence the network's capacity. We define the number of layers as (K) and the number of hidden units in each layer as $(D_1, D_2, \ldots, D_K)$. These hyperparameters, selected prior to learning the model parameters, dictate the family of functions the model can describe, with the specific parameters determining the exact function.

We describe the vector of hidden units at layer (k) as $(h_k)$, the biases contributing to layer (k + 1) as $(\beta_k)$, and the weights applied to layer (k affecting layer (k + 1) as $(\Omega_k)$. Thus, a general deep network $(y = f[x, \varphi])$ with (K) layers can be articulated through the following recursive formulation:

$$\begin{align*}
h_1 &= a|\beta_0 + \Omega_0x|, \\
h_2 &= a|\beta_1 + \Omega_1h_1|, \\
&\vdots \\
h_k &= a|\beta_{K-1} + \Omega_{K-1}h_{K-1}|, \\
y &= \beta_K + \Omega_Kh_K.
\end{align*}$$

In [1]:
import numpy as np,matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("dark_background")

In [2]:
def ReLU(value:np.ndarray)->np.ndarray:
  return np.maximum(0,value)

In [3]:
def ShallowNetwork(sample:np.ndarray,theta:np.ndarray,phi:np.ndarray)->float|int:
  theta_10,theta_11,theta_20,theta_21,theta_30,theta_31 = theta
  h1 = ReLU(theta_10+theta_11*sample)  # Compute hidden layer activations using ReLU
  h2 = ReLU(theta_20+theta_21*sample)  # Compute hidden layer activations using ReLU
  h3 = ReLU(theta_30+theta_31*sample)  # Compute hidden layer activations using ReLU
  y = phi[0]+phi[1]*h1+phi[2]*h2+phi[3]*h3 # output
  return y

In [5]:
def ComposeNetworks(sample:np.ndarray,thetaFirst:np.ndarray,phiFirst:np.ndarray,thetaSecond:np.ndarray,phiSecond:np.ndarray)->float|int:
  yFirst = ShallowNetwork(sample,thetaFirst,phiFirst) # Output of the first network
  ySecond = ShallowNetwork(yFirst,thetaSecond,phiSecond) # Use the output of the first as input to the second
  return ySecond

In [6]:
def DNN(sample:np.ndarray,parameters:np.ndarray)->float|int:
  thetaFirst,psi,phiSecond = parameters
  # First layer computation
  h1 = ReLU(thetaFirst[0]+thetaFirst[1]*sample)
  h2 = ReLU(thetaFirst[2]+thetaFirst[3]*sample)
  h3 = ReLU(thetaFirst[4]+thetaFirst[5]*sample)
  # Second layer computation using first layer's output
  h1Prime = ReLU(psi[0]+psi[1]*h1+psi[2]*h2+psi[3]*h3)
  h2Prime = ReLU(psi[4]+psi[5]*h1+psi[6]*h2+psi[7]*h3)
  h3Prime = ReLU(psi[8]+psi[9]*h1+psi[10]*h2+psi[11]*h3)
  # Output computation
  yPrime = phiSecond[0]+phiSecond[1]*h1Prime+phiSecond[2]*h2Prime+phiSecond[3]*h3Prime
  return yPrime

In [28]:
samples = np.linspace(-5,5,100) # sample
thetaValues = np.linspace(-1,3,6)
thetaValues_2 = np.linspace(-2,1,6)
phiValues = np.linspace(-4,5,4)
phiValues_2 = np.linspace(-6,2,4)
psiValues = np.linspace(-2,5,12)

In [30]:
firstOutput = ShallowNetwork(samples,thetaValues,phiValues)
secondOutput = ComposeNetworks(samples,thetaValues,phiValues,thetaValues_2,phiValues_2)
print(f"First output shape from Shallow: {firstOutput.shape}")
print(f"Second output shape from compose network: {secondOutput.shape}")

First output shape from Shallow: (100,)
Second output shape from compose network: (100,)


In [31]:
parameters = [thetaValues,psiValues,phiValues_2]

In [32]:
dnnResult = DNN(samples,parameters)
print(f"DNN result shape: {dnnResult.shape}")

DNN result shape: (100,)


- with forward pass:

In [7]:
inputSize = 1 # Input size (x)
hiddenSize = [3,3] # Number of units in hidden layers (D_1, D_2)
outputSize = 1 # Output size (y')
numberLayers = 2 # Number of layers (K)

In [8]:
np.random.seed(45) # For reproducibility

In [9]:
# Initialize weights (Ω_k) and biases (β_k) for each layer
weights = [np.random.randn(hiddenSize[i-1] if i > 0 else inputSize,size) for i,size in enumerate(hiddenSize+[outputSize])]
biases = [np.random.randn(size) for size in hiddenSize+[outputSize]]

In [14]:
print(f"Length of weights: {len(weights)}")
print(f"Length of biases: {len(biases)}")

Length of weights: 3
Length of biases: 3


In [15]:
print(f"Weight shape: {weights[0].shape}")
print(f"Bias shape: {biases[0].shape}")

Weight shape: (1, 3)
Bias shape: (3,)


In [19]:
print(f"Weight example: {weights[0]}\n\nBias example: {biases[0]}")

Weight example: [[ 0.02637477  0.2603217  -0.39514554]]

Bias example: [-0.49505193 -1.00860081  0.02524419]


In [16]:
def ForwardPass(sample:np.ndarray)->float|int:
  h = [None]*numberLayers # Hidden layer activations
  h[0] = ReLU(biases[0]+np.dot(weights[0].T,sample))
  for k in range(1,numberLayers):
    h[k] = ReLU(biases[k]+np.dot(weights[k].T,h[k-1]))
  # Output layer
  yPrime = biases[-1]+np.dot(weights[-1].T,h[-1])
  return yPrime

In [17]:
sample = np.array([0.5])
yPrime = ForwardPass(sample)
print(f"Output: {yPrime} for input --> ::[{sample}]::")

Output: [-1.39381267] for input --> ::[[0.5]]::


- house price example:

In [47]:
inputSize = 1 # House size
hiddenSize = [3,3] # Two hidden layers with three units each
outputSize = 1 # Predicted price

In [48]:
weights = [
    np.random.randn(inputSize,hiddenSize[0]),
    np.random.randn(hiddenSize[0],hiddenSize[1]),
    np.random.randn(hiddenSize[1],outputSize)
]
biases = [
    np.random.rand(size) for size in hiddenSize+[outputSize]
]

In [49]:
def ForwardPass(sample:np.ndarray)->float|int:
  h = ReLU(np.dot(sample,weights[0])+biases[0])
  for idx in range(1,len(weights)-1):
    h = ReLU(np.dot(h,weights[idx])+biases[idx])
  yPrediction = np.dot(h,weights[-1])+biases[-1]
  return yPrediction

In [50]:
sampleHouseSize = np.array([5000]) # Example: 5000 sqft
predictedPrice = ForwardPass(sampleHouseSize)
print(f"Predicted Price for a house of size {sampleHouseSize[0]} sqft --> ${predictedPrice[0]:.2f}")

Predicted Price for a house of size 5000 sqft --> $0.73


- learning rate & loss:

In [76]:
np.random.seed(45) # Ensure reproducible results

In [83]:
xData = np.linspace(-10,10,100)[:,np.newaxis] # 100 data points in the range [-10, 10]
trueParameters = {"weight":2,"bias":3}
groundTruth = trueParameters["weight"]*xData+trueParameters["bias"]+np.random.normal(0,2,xData.shape) # Add Gaussian noise

In [84]:
# Initial parameters
weight = np.random.randn(1,1)
print(f"Initial weight: {weight[0][0]}")
bias = np.random.randn(1)
print(f"Initial bias: {bias[0]}")

Initial weight: 1.7184018866795001
Initial bias: 1.2945594773525302


In [85]:
learningRate = 0.01

In [86]:
def LeastSquaresLoss(groundTruth:np.ndarray,predictions:np.ndarray)->np.ndarray:
  return np.mean((groundTruth-predictions)**2)

In [87]:
def PredictionFunction(value:int|float)->np.ndarray:
  return value*weight+bias

In [88]:
epochs = 1000
for e in range(epochs):
  predictions = PredictionFunction(xData)
  loss = LeastSquaresLoss(groundTruth,predictions)
  if e % 100 == 0:
    print(f"Epoch {e} --> LOSS :[{loss}] with parameters | Weight:[{weight}]: & Bias:[{bias}]:")
  gradWeight = -2 * np.mean(xData*(groundTruth-predictions))
  gradBias = -2 * np.mean(groundTruth-predictions)
  weight -= learningRate*gradWeight
  bias -= learningRate*gradBias

Epoch 0 --> LOSS :[10.457490859431765] with parameters | Weight:[[[1.71840189]]]: & Bias:[[1.29455948]]:
Epoch 100 --> LOSS :[4.083007606594319] with parameters | Weight:[[[2.01246432]]]: & Bias:[[2.91619071]]:
Epoch 200 --> LOSS :[4.022613610302524] with parameters | Weight:[[[2.01246432]]]: & Bias:[[3.13125072]]:
Epoch 300 --> LOSS :[4.021551403920438] with parameters | Weight:[[[2.01246432]]]: & Bias:[[3.15977189]]:
Epoch 400 --> LOSS :[4.021532721891305] with parameters | Weight:[[[2.01246432]]]: & Bias:[[3.16355435]]:
Epoch 500 --> LOSS :[4.021532393312774] with parameters | Weight:[[[2.01246432]]]: & Bias:[[3.16405598]]:
Epoch 600 --> LOSS :[4.021532387533752] with parameters | Weight:[[[2.01246432]]]: & Bias:[[3.1641225]]:
Epoch 700 --> LOSS :[4.021532387432112] with parameters | Weight:[[[2.01246432]]]: & Bias:[[3.16413133]]:
Epoch 800 --> LOSS :[4.021532387430324] with parameters | Weight:[[[2.01246432]]]: & Bias:[[3.1641325]]:
Epoch 900 --> LOSS :[4.021532387430293] with para

In [89]:
print(f"Learned parameters - Weight: {weight.flatten()}, Bias: {bias}")

Learned parameters - Weight: [2.01246432], Bias: [3.16413267]
