# Welcome to the Neural Network Playground

Welcome to this Jupyter notebook, the interactive companion to the "Neural Network Playground" repository. Here, we embark on a fascinating journey through the diverse world of neural networks, exploring various architectures and their unique capabilities. This notebook is crafted to be both an educational resource and a practical guide, providing you with the opportunity to dive deep into the functionalities, designs, and applications of neural networks across different tasks and data types.

Through descriptive explanations, code implementations, and hands-on examples, we aim to foster a deeper understanding of neural networks and inspire you to experiment, innovate, and contribute to the field of artificial intelligence and machine learning. Let's begin our exploration and unlock the potential of neural networks together.

# Table of Contents

## [Setup](#Setup)
- [Imports](#Imports)

## [Foundational Concepts](#Foundational-Concepts)
- [Base Neural Network Class (BaseNN)](#Base-Neural-Network-Class-(BaseNN))

#### Foundational Models:
- **Basic Neural Networks**: 
  - Perceptron
  - Feed Forward
  - Radial Basis Function Network
- **Deep Learning Essentials**: 
  - Deep Feed Forward (DFF)

#### Deep Learning Architectures:
- **Core Architectures**: 
  - **Autoencoders**
      - **AE**
      - **VAE**
      - **DAE**
      - **SAE**
  - Deep Convolutional Network (DCN) 
  
- **Recurrent and Memory Models**: 
  - RNN
  - LSTM
  - GRU
  - NTM

#### Advanced Concepts:
- **Probabilistic and Generative Models**: 
  - Markov Chain
  - Hopfield Network
  - Boltzmann Machine
  - Restricted Boltzmann Machine
  - Deep Belief Network
  - GAN
- **Hybrid and Specialized Models**: 
  - SNN
  - LSM
  - ELM
  - ESN
  - ResNet
  - Kohonen Network (SOFM)
  - SVM

# Setup

# Imports

In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

# Foundational Concepts
## Base Neural Network Class (BaseNN)

Creating a BaseNN class intended to use inheritance in later implementations of different NN's when abstracting the base class to make specialized classes.

In the development of this neural network project, we introduce the `BaseNN` class as a foundational component, leveraging the principles of object-oriented programming to foster a modular and scalable approach to neural network design. 

The `BaseNN` class serves as a blueprint for all subsequent neural network models, encapsulating common attributes such as `input_size`, `hidden_size`, and `output_size`. 

These attributes are essential across a wide range of neural network architectures, ensuring a consistent structure across our implementations. 

Furthermore, the class defines an abstract `forward` method, which obliges any derived class to specify its own data processing mechanism, detailing how inputs are transformed into outputs through the network. This approach not only enforces a uniform interface but also promotes code reusability and simplifies the process of experimenting with and extending different neural network models. 

By abstracting common functionalities into the `BaseNN` class, we significantly reduce redundancy and streamline the development of specialized neural network architectures, allowing for a clear and efficient exploration of the vast landscape of neural network designs.

In [2]:
# Base class for neural networks
class BaseNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(BaseNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

    def forward(self, x):
        raise NotImplementedError("forward method must be implemented in derived classes")

# Foundational Models

## Basic Neural Networks

### Perceptron (P)

**High-Level Overview**

The Perceptron represents the simplest form of a feedforward neural network, consisting of a single neuron with adjustable weights and a bias. Developed in 1957 by Frank Rosenblatt, it laid the groundwork for understanding neural networks. The Perceptron algorithm is a binary classifier that linearly separates data into two parts, making it a cornerstone in the study of machine learning for simple predictive modeling tasks.

**Data Type**

Perceptrons can process:
- Numerical data
- Binary features

Given its simplicity, it's primarily suited for linearly separable datasets where inputs can be categorized into two distinct groups.

**Task Objective**

Perceptrons are utilized for:
- Binary classification tasks
- Basic pattern recognition

Their straightforward approach allows them to make decisions by weighing input features, showcasing early neural network capabilities in distinguishing between two classes.

**Scalability**

Due to its simplicity, the scalability of a single-layer Perceptron is limited to problems that are linearly separable. For more complex datasets or non-linear problems, multi-layer networks or different algorithms are recommended.

**Robustness to Noise**

Perceptrons can be sensitive to noise in the data, especially since they do not incorporate error minimization in the same way as more advanced models. They perform best with clean, well-defined datasets.

**Implementation Variants**

While the basic Perceptron is foundational, several key developments have been made to extend its utility, including:
- **Multi-layer Perceptrons (MLPs):** Comprising multiple layers of neurons to tackle non-linearly separable data.
- **Stochastic Gradient Descent:** An optimization method allowing Perceptrons and their multi-layer successors to learn from training data iteratively.

**Practical Application Guidance**

**When to Use Perceptrons:**
- For simple linear classification problems.
- As a learning tool to understand the basics of neural network architecture and linear decision boundaries.

**Considerations:**
- The Perceptron's inability to solve non-linear problems limits its application in complex real-world scenarios.
- It serves as a building block for more sophisticated networks that can handle a broader range of tasks.

**Conclusion**

The Perceptron model, with its simplicity, offers a fundamental understanding of neural network principles. Although its direct applications are limited to linearly separable tasks, the Perceptron remains an essential concept in machine learning, providing a stepping stone to more advanced neural network architectures and algorithms.

In [3]:
class Perceptron:
    def __init__(self, input_size):
        # Initialize weights and bias randomly
        self.weights = np.random.rand(input_size)
        self.bias = np.random.rand()

    def activate(self, x):
        # Simple step function as activation
        return 1 if x > 0 else 0

    def forward(self, inputs):
        # Calculate the weighted sum of inputs
        weighted_sum = np.dot(inputs, self.weights) + self.bias

        # Apply the activation function
        output = self.activate(weighted_sum)

        return output

In [4]:
# Example Usage
if __name__ == "__main__":
    # Create a perceptron with 2 input cells
    perceptron = Perceptron(input_size=2)

    # Example input
    input_data = np.array([0.5, 0.8])

    # Get the output from the perceptron
    output = perceptron.forward(input_data)

    print(f"Input: {input_data}")
    print(f"Output: {output}")

Input: [0.5 0.8]
Output: 1


### Feed Forward (FF)

**High-Level Overview**

Feed Forward Neural Networks (FFNNs) are the simplest type of artificial neural network architecture, where connections between the nodes do not form a cycle. This model is structured in layers, consisting of an input layer, one or more hidden layers, and an output layer. The data moves in only one direction - forward - from the input nodes, through the hidden nodes (if any), and finally to the output nodes. There are no cycles or loops in the network, hence the name "feedforward."

**Data Type**

Feed Forward Neural Networks are capable of handling a variety of data types, making them versatile for numerous applications:

- Numerical data
- Categorical data
- Images (when flattened to a vector)
- Text (via bag-of-words or TF-IDF vectors)

Their adaptability makes FFNNs suitable for a broad range of tasks across different fields.

**Task Objective**

FFNNs are widely used for:

- Classification tasks, both binary and multi-class.
- Regression tasks for predicting continuous outcomes.
- Pattern recognition, serving as the foundational architecture for more complex tasks.

They serve as the backbone for understanding more complex neural network architectures.

**Scalability**

The scalability of Feed Forward Neural Networks depends on the size of the input data and the complexity of the task. While adding more hidden layers can increase the network's capacity to learn complex patterns, it also raises the computational cost and the risk of overfitting. Techniques like dropout and regularization are often employed to manage these challenges.

**Robustness to Noise**

FFNNs exhibit a degree of robustness to noise in the input data, thanks to their capacity to learn generalized representations. However, their performance can be significantly affected by the presence of irrelevant features or highly noisy datasets, necessitating careful data preprocessing and feature selection.

**Implementation Variants**

Feed Forward Neural Networks can be implemented with various activation functions (ReLU, Sigmoid, Tanh) and architectures (deep networks with many layers, wide networks with more neurons per layer) to suit specific problems:

- **Deep Feed Forward Networks**: Incorporate multiple hidden layers to capture complex patterns.
- **Wide Networks**: Increase the number of neurons in hidden layers to enhance model capacity without deepening the architecture.

**Practical Application Guidance**

**When to Use Feed Forward Neural Networks:**

- For straightforward prediction problems where the complexity of recurrent or convolutional networks is unnecessary.
- In cases where the data can be represented in a fixed-size vector and does not possess inherent sequential or spatial patterns.

**Considerations:**

- While FFNNs are powerful for many tasks, they may not be ideal for data with temporal sequences (e.g., time-series) or spatial hierarchies (e.g., images), where recurrent or convolutional architectures might be more appropriate.
- Careful design and regularization are essential to prevent overfitting, especially as the network size increases.

**Conclusion**

Feed Forward Neural Networks form the cornerstone of neural network models, offering a straightforward yet powerful framework for numerous predictive modeling tasks. Their simplicity, coupled with the potential for customization and scalability, makes them an indispensable tool. Understanding and mastering FFNNs provide a solid foundation for delving into more specialized neural network architectures.

In [5]:
class FeedforwardNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(FeedforwardNN, self).__init__()
        self.input_layer = nn.Linear(input_size, hidden_size)
        self.hidden_layer = nn.Linear(hidden_size, hidden_size)
        self.output_layer = nn.Linear(hidden_size, 1)  # Single output neuron

    def forward(self, x):
        x = torch.relu(self.input_layer(x))
        x = torch.relu(self.hidden_layer(x))
        x = torch.sigmoid(self.output_layer(x))
        return x

# Instantiate the neural network
input_size = 2  # Number of input features 
hidden_size = 3  # Number of neurons in the hidden layers
model = FeedforwardNN(input_size, hidden_size)

In [6]:
# Define a sample input
sample_input = torch.tensor([[0.5, 0.3]])  # Example Data

# Forward pass to get the output
output = model(sample_input)

# Print the model architecture and output
print(model)
print("Input:", sample_input)
print("Output:", output.item())

FeedforwardNN(
  (input_layer): Linear(in_features=2, out_features=3, bias=True)
  (hidden_layer): Linear(in_features=3, out_features=3, bias=True)
  (output_layer): Linear(in_features=3, out_features=1, bias=True)
)
Input: tensor([[0.5000, 0.3000]])
Output: 0.47536784410476685


### Radial Basis Network (RBF)

**High-Level Overview**

Radial Basis Function Networks (RBFNs) are a type of artificial neural network that uses radial basis functions as activation functions. They are typically used for interpolation in multidimensional space, pattern recognition, function approximation, and time-series prediction. The core idea behind RBFNs is to transform the input space into a new space where the problem becomes linearly separable. This transformation is achieved using a set of radial basis functions, each associated with a center and affecting only the region close to that center.

**Data Type**

RBF Networks are particularly effective with:

- Numerical data
- Multidimensional data for function approximation
- Patterns that require a localized response

Their ability to handle non-linear problems makes them suitable for various tasks in regression, classification, and clustering.

**Task Objective**

RBF Networks excel in:

- Function approximation and regression tasks
- Classification problems
- Time-series prediction
- Clustering and unsupervised learning

The localized nature of radial basis functions allows RBFNs to model complex and non-linear relationships within the data efficiently.

**Scalability**

The scalability of RBF Networks can be challenging due to the need to select an appropriate number of centers and their locations. While having more centers can improve the network's ability to approximate complex functions, it also increases the computational cost and the risk of overfitting.

**Robustness to Noise**

RBF Networks demonstrate robustness to noise in the input data due to the smoothness of the radial basis functions. However, the choice of the width parameter of the basis functions is crucial, as it influences the network's sensitivity to the input data's scale and noise level.

**Implementation Variants**

Several variations of RBF Networks exist, primarily differing in how the centers and the width of the basis functions are determined:

- **Fixed Centers Selected Randomly**: Centers are chosen randomly from the input data.
- **Clustering-based Centers**: Centers are determined using clustering algorithms like k-means to capture the data's underlying structure.
- **Orthogonal Least Squares (OLS)**: A more sophisticated method for selecting centers that aims to minimize redundancy among the basis functions.

**Practical Application Guidance**

**When to Use Radial Basis Function Networks:**

- In situations where the data exhibits non-linear relationships that need to be captured with high precision.
- For problems where local interactions dominate the system's behavior, and a global approximation model might not be effective.

**Considerations:**

- The selection of the number of centers and their locations is critical for the performance of the RBF Network. Incorrect choices can lead to poor generalization or overfitting.
- The determination of the width parameter requires careful tuning, often based on cross-validation, to balance the trade-off between bias and variance.

**Conclusion**

Radial Basis Function Networks offer a powerful and flexible framework for addressing non-linear problems across various domains. By leveraging localized responses to input stimuli, RBFNs can model complex relationships within data, making them a valuable tool for tasks requiring high precision in function approximation, classification, and beyond. Understanding and effectively implementing RBF Networks can provide practitioners with a robust method for tackling challenging problems that traditional linear models cannot solve.


In [7]:
class RadialBasisFunction:
    def __init__(self, input_size, num_centers):
        # Initialize centers and width parameters randomly
        self.centers = np.random.rand(num_centers, input_size)
        self.width = np.random.rand()
        self.weights = np.random.rand(num_centers)
    
    def gaussian(self, x, center, width):
        # Gaussian activation function
        return np.exp(-np.sum((x - center)**2) / (2 * width**2))
    
    def forward(self, inputs):
        # Calculate the activation for each center
        activations = np.array([self.gaussian(inputs, center, self.width) for center in self.centers])
        
        # Calculate the weighted sum of activations
        weighted_sum = np.dot(activations, self.weights)
        
        # Apply a threshold for binary output
        output = 1 if weighted_sum > 0.5 else 0
        
        return output

In [8]:
# Example Usage
if __name__ == "__main__":
    # Create an RBF network with 2 input cells and 3 centers
    rbf_network = RadialBasisFunction(input_size=2, num_centers=3)
    
    # Example input
    input_data = np.array([0.5, 0.8])
    
    # Get the output from the RBF network
    output = rbf_network.forward(input_data)
    
    print(f"Input: {input_data}")
    print(f"Output: {output}")

Input: [0.5 0.8]
Output: 1


## Deep Learning Essentials

### Deep Feed Forward (DFF)

**High-Level Overview**

Deep Feedforward Neural Networks, often simply referred to as Deep Neural Networks (DNNs), are the quintessential deep learning models. These networks extend the concept of the basic feedforward neural network by introducing multiple hidden layers between the input and output layers. This architecture enables the learning of complex patterns and hierarchies in data, making DNNs incredibly effective for a wide range of predictive modeling tasks.

**Data Type**

Deep Feedforward Neural Networks are designed to handle:
- Numerical data
- Images
- Text
- Audio signals

Their flexibility and capacity for high-dimensional data processing make them applicable across nearly all domains of machine learning and artificial intelligence.

**Task Objective**

DNNs are particularly proficient in:
- Classification
- Regression
- Pattern recognition
- Feature extraction

The depth of these networks allows for the modeling of complex relationships in the data, contributing to advancements in areas like computer vision, natural language processing, and more.

**Scalability**

The scalability of Deep Feedforward Neural Networks is a hallmark of their design. With the ability to adjust the number of layers and nodes within those layers, DNNs can be tailored to the complexity of the task at hand. However, this scalability comes with increased computational demands, necessitating efficient training techniques and hardware acceleration in many cases.

**Robustness to Noise**

DNNs exhibit a notable degree of robustness to noise and variability in the input data, thanks to their layered structure and the non-linear transformations applied at each layer. This makes them well-suited for real-world applications where data imperfections are common.

**Implementation Variants**

Deep Feedforward Neural Networks can be customized with various activation functions, optimization algorithms, and regularization techniques to improve their performance and generalization ability. Common variants include:
- **ReLU-activated DNNs:** Use the Rectified Linear Unit function to introduce non-linearity without the vanishing gradient problem.
- **Dropout-regularized DNNs:** Implement dropout layers to reduce overfitting by randomly omitting subsets of features during training.

**Practical Application Guidance**

**When to Use Deep Feedforward Neural Networks:**
- In tasks requiring the modeling of complex relationships or patterns in the data.
- When the dataset is large and high-dimensional, providing enough information to train deep models effectively.

**Considerations:**
- The depth and complexity of DNNs necessitate careful design and training to avoid issues like overfitting and ensure sufficient generalization to new data.
- Training deep models can be computationally intensive and time-consuming, requiring appropriate hardware and optimization strategies.

**Conclusion**

Deep Feedforward Neural Networks are a foundational pillar of modern deep learning, offering unparalleled flexibility and learning capacity. Their ability to learn from and make predictions on complex data has revolutionized many fields of study and industry. By leveraging advanced training techniques and computational resources, practitioners can unlock the full potential of DNNs to solve a vast array of challenging problems.

In [9]:
# Deep Feed Forward Neural Network
class DeepFeedforwardNN(BaseNN):
    def __init__(self, input_size, hidden_size, output_size):
        super(DeepFeedforwardNN, self).__init__(input_size, hidden_size, output_size)
        self.input_layer = nn.Linear(input_size, hidden_size)
        self.hidden_layers = nn.ModuleList([
            nn.Linear(hidden_size, hidden_size) for _ in range(2)  # Two hidden layers with 4 nodes each
        ])
        self.output_layer = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = torch.relu(self.input_layer(x))
        for layer in self.hidden_layers:
            x = torch.relu(layer(x))
        x = torch.sigmoid(self.output_layer(x))
        return x

In [10]:
# Instantiate the deep feedforward neural network
input_size = 3  # Number of input features 
hidden_size = 4  # Number of nodes in each hidden layer
output_size = 2  # Number of output nodes
deep_feedforward_model = DeepFeedforwardNN(input_size, hidden_size, output_size)

# Define a sample input
sample_input = torch.tensor([[0.5, 0.3, 0.8]])  # Example Data

# Forward pass to get the output
output_deep_feedforward = deep_feedforward_model(sample_input)

# Print the model architecture and output
print(deep_feedforward_model)
print("Output:", output_deep_feedforward)

DeepFeedforwardNN(
  (input_layer): Linear(in_features=3, out_features=4, bias=True)
  (hidden_layers): ModuleList(
    (0-1): 2 x Linear(in_features=4, out_features=4, bias=True)
  )
  (output_layer): Linear(in_features=4, out_features=2, bias=True)
)
Output: tensor([[0.4298, 0.3762]], grad_fn=<SigmoidBackward0>)


# Deep Learning Architectures

## Core Architectures

### Autoencoder (AE)

**High-Level Overview**

Autoencoders (AEs) are a type of neural network used for unsupervised learning of efficient data codings. The primary goal of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise.” This is achieved by designing the autoencoder to encode inputs into a low-dimensional space and then decode these encodings back into the original input data.

**Data Type**

Autoencoders can handle a variety of data types, including:
- Numerical data
- Images
- Audio signals
- Text data

Their versatility makes them suitable for applications ranging from compression to noise reduction or feature extraction.

**Task Objective**

The main applications of AEs include:
- Dimensionality reduction
- Feature learning
- Data compression
- Denoising

AEs are particularly useful in scenarios where the intrinsic structure of the data needs to be learned without labeled data.

**Scalability**

Autoencoders scale well with the complexity of the data and the desired level of compression or feature extraction. The network architecture can be adjusted according to the specific requirements of the task, allowing for flexible implementations that cater to large and high-dimensional datasets.

**Robustness to Noise**

Autoencoders, especially denoising autoencoders (DAEs), are designed to be robust to noise in the input data. By learning to reconstruct inputs from corrupted versions, they can effectively identify and ignore irrelevant or noisy features in the data.

**Implementation Variants**

Several variants of autoencoders have been developed to address different challenges, including:
- **Variational Autoencoders (VAEs):** Focus on generating new data that is similar to the training data.
- **Denoising Autoencoders (DAEs):** Aim to remove noise from corrupted input data.
- **Sparse Autoencoders (SAEs):** Introduce sparsity in the encoded representations to improve feature selection.

Note: These three variants each have their own sections directly following this.

**Practical Application Guidance**

**When to Use Autoencoders:**
- For tasks requiring data compression without significant loss of information.
- When looking to learn efficient representations of data without supervision.
- In applications where the removal of noise or the extraction of relevant features from the data is essential.

**Considerations:**
- The choice of autoencoder variant and network architecture should be aligned with the specific objectives of the task.
- Careful tuning of the network parameters is crucial to achieve the desired balance between compression, reconstruction accuracy, and feature learning.

**Conclusion**

Autoencoders offer a powerful framework for learning efficient representations of data in an unsupervised manner. By leveraging their ability to compress and denoise data, as well as to learn salient features, autoencoders are instrumental in various applications across machine learning and signal processing domains.


In [11]:
class Autoencoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Linear(input_size, hidden_size)
        self.decoder = nn.Linear(hidden_size, input_size)

    def forward(self, x):
        encoded = torch.relu(self.encoder(x))
        decoded = torch.sigmoid(self.decoder(encoded))
        return decoded

In [12]:
# Instantiate the autoencoder
input_size = 10  # Number of input features
hidden_size = 5  # Number of hidden nodes (compressed representation)
autoencoder = Autoencoder(input_size, hidden_size)

# Define a sample input
sample_input = torch.rand((1, input_size))

# Forward pass to get the reconstructed output
output_autoencoder = autoencoder(sample_input)

# Print the model architecture, input, and output
print(autoencoder)
print("Input:", sample_input)
print("Output:", output_autoencoder)

Autoencoder(
  (encoder): Linear(in_features=10, out_features=5, bias=True)
  (decoder): Linear(in_features=5, out_features=10, bias=True)
)
Input: tensor([[0.7870, 0.7469, 0.8532, 0.2466, 0.9331, 0.9623, 0.0533, 0.2371, 0.3791,
         0.9191]])
Output: tensor([[0.5045, 0.4231, 0.4493, 0.4943, 0.4588, 0.6211, 0.4717, 0.3584, 0.3688,
         0.6155]], grad_fn=<SigmoidBackward0>)


#### VAE

**High-Level Overview**

Variational Autoencoders (VAEs) are a cornerstone in the field of generative AI, representing a powerful class of deep learning models for generative modeling. They are designed to learn the underlying probability distribution of training data, enabling the generation of new data points with similar properties. VAEs combine traditional autoencoder architecture with variational inference principles, allowing them to compress data into a latent space and then generate data by sampling from this space, thereby facilitating a deep exploration of the continuous latent space representing the data.

**Data Type**

VAEs demonstrate remarkable adaptability across a range of data types, including:
- Images
- Text
- Audio
- Continuous numerical data

This versatility underscores their prominence in generative AI, making them a popular choice for a wide array of generative tasks.

**Task Objective**

Emphasizing their role in generative AI, VAEs excel in:
- Data generation
- Feature extraction and representation learning
- Dimensionality reduction
- Anomaly detection

Their deep learning capabilities enable them not only to model complex distributions but also to generate new, coherent samples, showcasing the transformative potential of generative AI.

**Scalability**

With their deep neural network architecture, VAEs scale effectively to accommodate the complexity and volume of vast datasets, further solidifying their status in generative AI for handling high-dimensional data efficiently.

**Robustness to Noise**

VAEs' proficiency in denoising and reconstructing inputs highlights their robustness, making them invaluable for applications in generative AI where data cleanliness cannot be assured.

**Implementation Variants**

Reflecting the innovation in generative AI, various VAE models have been developed to address specific challenges or improve upon the original framework, including Conditional VAEs, Beta-VAEs, and Disentangled VAEs, each offering unique advantages for controlled data generation and enhanced interpretability of latent representations.

**Practical Application Guidance**

In the realm of generative AI, VAEs are particularly suited for:
- Generating new data that mimics the properties of specific datasets.
- Unsupervised learning of complex data distributions.
- Applications requiring a nuanced understanding of data's underlying structure.

**Considerations:**

Training VAEs can present challenges, such as mode collapse, underscoring the need for expertise in generative AI to navigate these complexities successfully.

**Conclusion**

Variational Autoencoders (VAEs) have cemented their place as a fundamental technology in generative AI, offering a sophisticated mechanism for understanding and generating data. Their broad applicability and the depth of insight they provide into data's inherent structure make them a pivotal tool in the advancement of generative modeling.

In [13]:
class VariationalAutoencoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(VariationalAutoencoder, self).__init__()

        # Encoder layers
        self.encoder_fc1 = nn.Linear(input_size, hidden_size)
        self.encoder_fc2_mean = nn.Linear(hidden_size, hidden_size)
        self.encoder_fc2_logvar = nn.Linear(hidden_size, hidden_size)

        # Decoder layers
        self.decoder_fc1 = nn.Linear(hidden_size, input_size)
        self.decoder_fc2 = nn.Linear(input_size, input_size)

    def reparameterize(self, mean, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mean + eps * std

    def forward(self, x):
        # Encoder
        x = torch.relu(self.encoder_fc1(x))
        mean = self.encoder_fc2_mean(x)
        logvar = self.encoder_fc2_logvar(x)

        # Reparameterization trick
        z = self.reparameterize(mean, logvar)

        # Decoder
        x_hat = torch.relu(self.decoder_fc1(z))
        x_hat = torch.sigmoid(self.decoder_fc2(x_hat))

        return x_hat, mean, logvar

In [14]:
# Instantiate the variational autoencoder
input_size = 4  # Number of input features
hidden_size = 4  # Number of hidden nodes in probabilistic layer
vae = VariationalAutoencoder(input_size, hidden_size)

# Define a sample input
sample_input = torch.rand((1, input_size))

# Forward pass to get the reconstructed output and latent variables
output_vae, mean, logvar = vae(sample_input)

# Print the model architecture and output
print(vae)

print("\nInput:", sample_input)
print("Output:", output_vae)
print("Mean:", mean)
print("Log Variance:", logvar)

VariationalAutoencoder(
  (encoder_fc1): Linear(in_features=4, out_features=4, bias=True)
  (encoder_fc2_mean): Linear(in_features=4, out_features=4, bias=True)
  (encoder_fc2_logvar): Linear(in_features=4, out_features=4, bias=True)
  (decoder_fc1): Linear(in_features=4, out_features=4, bias=True)
  (decoder_fc2): Linear(in_features=4, out_features=4, bias=True)
)

Input: tensor([[0.0343, 0.9712, 0.4583, 0.0153]])
Output: tensor([[0.4484, 0.5287, 0.4188, 0.4760]], grad_fn=<SigmoidBackward0>)
Mean: tensor([[-0.1030, -0.2888, -0.1711,  0.0543]], grad_fn=<AddmmBackward0>)
Log Variance: tensor([[-0.0906, -0.0857,  0.0126,  0.3173]], grad_fn=<AddmmBackward0>)


#### DAE

**High-Level Overview**

Denoising Autoencoders (DAEs) are an advanced type of autoencoder designed to *remove noise from input data*. By intentionally corrupting the input data and then learning to reconstruct the original, uncorrupted data, DAEs are trained to capture the most relevant features. This process enhances the model's ability to generalize from the data, making it highly effective for tasks that require robust feature extraction and data denoising capabilities.

**Data Type**

Denoising Autoencoders are capable of processing various data types, including:
- Images
- Text
- Audio signals
- Continuous numerical data

Their adaptability makes them particularly useful for applications involving noisy or incomplete data.

**Task Objective**

Denoising Autoencoders are primarily used for:
- Data denoising
- Feature extraction and representation learning
- Dimensionality reduction
- Data generation and enhancement

By learning to ignore the "noise" in data, DAEs excel in recovering clean representations from corrupted inputs.

**Scalability**

Similar to other autoencoders, the scalability of DAEs depends on the network architecture. Modern techniques and computational resources allow DAEs to handle large datasets and complex noise patterns effectively, showcasing their scalability in practical applications.

**Robustness to Noise**

The core strength of DAEs lies in their robustness to noise. They are specifically trained to identify and ignore irrelevant features (noise), focusing on reconstructing the essential aspects of the data, which makes them exceptionally reliable for denoising tasks.

**Implementation Variants**

Several variants of DAEs have been developed to address different types of noise or to enhance specific aspects of denoising, including:
- **Gaussian Noise DAEs:** Target Gaussian noise in the data.
- **Salt-and-Pepper Noise DAEs:** Designed to remove binary noise from images.
- **Variational DAEs:** Combine denoising capabilities with variational autoencoder frameworks for improved generative properties.

**Practical Application Guidance**

**When to Use Denoising Autoencoders:**
- For cleaning noisy data before further processing or analysis.
- In feature extraction tasks where maintaining data integrity is crucial.
- As a preprocessing step to improve the performance of subsequent machine learning models.

**Considerations:**
- The effectiveness of a DAE can vary based on the noise type and level; selecting the appropriate model variant is key.
- Training DAEs requires a balance between denoising capability and preserving relevant features, necessitating careful tuning of model parameters.

**Conclusion**

Denoising Autoencoders offer a powerful solution for improving data quality, with their unique training strategy enabling them to extract clean, relevant features from noisy inputs. Their versatility across different data types and robustness to various noise patterns make them an invaluable tool in the data preprocessing pipeline, enhancing the performance of machine learning and deep learning models across a wide range of applications.

In [15]:
class DenoisingAutoencoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(DenoisingAutoencoder, self).__init__()

        # Encoder layers
        self.encoder_fc1 = nn.Linear(input_size, hidden_size)

        # Decoder layers
        self.decoder_fc1 = nn.Linear(hidden_size, input_size)

    def forward(self, x):
        # Encoder
        x = torch.relu(self.encoder_fc1(x))

        # Decoder
        x_hat = torch.sigmoid(self.decoder_fc1(x))

        return x_hat

In [16]:
# Instantiate the denoising autoencoder
input_size = 4  # Number of input features
hidden_size = 4  # Number of hidden nodes
dae = DenoisingAutoencoder(input_size, hidden_size)

# Define a sample input (noisy data)
noisy_input = torch.rand((1, input_size))  # Example Noisy Data

# Forward pass to get the reconstructed output
output_dae = dae(noisy_input)

# Print the model architecture and output
print(dae)
print("\nNoisy Input:", noisy_input)
print("Reconstructed Output:", output_dae)

DenoisingAutoencoder(
  (encoder_fc1): Linear(in_features=4, out_features=4, bias=True)
  (decoder_fc1): Linear(in_features=4, out_features=4, bias=True)
)

Noisy Input: tensor([[0.8623, 0.3365, 0.8590, 0.2574]])
Reconstructed Output: tensor([[0.3802, 0.5042, 0.6004, 0.3728]], grad_fn=<SigmoidBackward0>)


#### SAE

**High-Level Overview**

Sparse Autoencoders represent a specialized variant of autoencoders, aimed at *unsupervised learning of compressed representations*. By introducing sparsity constraints, they enforce most neurons to be inactive, enhancing feature detection and data representation efficiency. This approach improves generalization, making them suitable for tasks requiring robust feature extraction.

**Data Type**

Sparse Autoencoders efficiently process:
- Images
- Text
- Audio signals
- Continuous numerical data

Their versatility across different data types highlights their utility in feature extraction and data compression tasks.

**Task Objective**

Key applications include:
- Feature extraction and representation learning
- Dimensionality reduction
- Data denoising
- Pretraining for deeper neural networks

Sparsity constraints enable these models to learn higher-level features, distinguishing them from traditional autoencoders.

**Scalability**

Despite sparsity aiding in learning efficient representations, the network's size and depth impact its ability to model complex distributions and computational requirements.

**Robustness to Noise**

They demonstrate significant robustness to noise, attributed to their focus on essential features, making them ideal for denoising and robust representation learning.

**Implementation Variants**

Variants are based on the sparsity enforcement method:
- **KL Divergence Sparse Autoencoder:** Penalizes deviations from a target sparsity level using Kullback-Leibler divergence.
- **L1 Regularization Sparse Autoencoder:** Applies L1 penalty on hidden units' activations to encourage sparsity.
- **Winner-Take-All (WTA) Sparse Autoencoder:** Only a fraction of the most active hidden units are allowed to update their weights, enhancing sparsity.

**Practical Application Guidance**

**When to Use Sparse Autoencoders:**
- In extracting meaningful features from high-dimensional data.
- For dimensionality reduction with interpretability.
- During pretraining phases for deep learning models, providing a good initial weight set that captures useful data patterns.

**Considerations:**
- Selecting the appropriate sparsity constraint and regularization technique is critical for balancing feature selectivity and model complexity.
- Hyperparameters require careful tuning to achieve desired sparsity levels and optimal performance.

**Conclusion**

Sparse Autoencoders stand out for learning efficient and interpretable data representations, with enforced sparsity offering clear advantages in feature selection and model robustness. They are invaluable in preprocessing, feature extraction, and as a pretraining step, enhancing subsequent models' performance across various data types and applications.

In [17]:
class SparseAutoencoder(BaseNN):
    def __init__(self, input_size, hidden_size, sparsity_target=0.1, sparsity_weight=0.2):
        super(SparseAutoencoder, self).__init__(input_size, hidden_size, output_size=input_size)

        # Encoder layers
        self.encoder_fc1 = nn.Linear(input_size, hidden_size)

        # Decoder layers
        self.decoder_fc1 = nn.Linear(hidden_size, input_size)

        # Sparsity parameters
        self.sparsity_target = sparsity_target
        self.sparsity_weight = sparsity_weight
        self.relu = nn.ReLU()

    def forward(self, x):
        # Encoder
        encoded = self.encoder_fc1(x)
        encoded = self.relu(encoded)

        # Decoder
        decoded = torch.sigmoid(self.decoder_fc1(encoded))

        return decoded, encoded

    def loss_function(self, x, x_hat, encoded):
        # Reconstruction loss
        reconstruction_loss = nn.functional.binary_cross_entropy(x_hat, x, reduction='mean')

        # Sparsity loss
        sparsity_loss = torch.sum(self.kl_divergence(self.sparsity_target, encoded))

        # Total loss
        total_loss = reconstruction_loss + self.sparsity_weight * sparsity_loss

        return total_loss

    def kl_divergence(self, target, activations):
        # KL Divergence to enforce sparsity
        p = torch.mean(activations, dim=0)  # Average activation over the dataset
        return target * torch.log(target / p) + (1 - target) * torch.log((1 - target) / (1 - p))

In [18]:
# Instantiate the sparse autoencoder
input_size = 5  # Number of input features
hidden_size = 3  # Number of hidden nodes
sae = SparseAutoencoder(input_size, hidden_size)

# Define a sample input
sample_input = torch.rand((1, input_size))  # Example Data

# Forward pass to get the reconstructed output and encoded representation
output_sae, encoded_sae = sae(sample_input)

# Calculate the loss
loss_sae = sae.loss_function(sample_input, output_sae, encoded_sae)

# Print the model architecture, output, and loss
print(sae)
print("Input:", sample_input)
print("Reconstructed Output:", output_sae)
print("Encoded Representation:", encoded_sae)
print("Loss:", loss_sae.item())

SparseAutoencoder(
  (encoder_fc1): Linear(in_features=5, out_features=3, bias=True)
  (decoder_fc1): Linear(in_features=3, out_features=5, bias=True)
  (relu): ReLU()
)
Input: tensor([[0.1020, 0.0088, 0.7749, 0.5999, 0.3943]])
Reconstructed Output: tensor([[0.5544, 0.6054, 0.4550, 0.4770, 0.4928]], grad_fn=<SigmoidBackward0>)
Encoded Representation: tensor([[0.0000, 0.0000, 0.3422]], grad_fn=<ReluBackward0>)
Loss: inf


### Deep Convolutional Network (DCN)

**High-Level Overview**

Deep Convolutional Networks (DCNs), also known as Convolutional Neural Networks (CNNs), are a specialized kind of neural network for processing data that has a known grid-like topology. 

Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image data, which can be seen as a 2D grid of pixels.

DCNs have been instrumental in many advances in computer vision, achieving remarkable success in tasks such as image recognition, object detection, and more.

**Data Type**

Deep Convolutional Networks are particularly adept at handling:
- Image data
- Video sequences
- Time-series data
- Any data that can be represented in a grid-like structure (e.g., sound waves visualized in spectrograms)

Their ability to automatically and adaptively learn spatial hierarchies of features makes them highly effective for tasks involving visual inputs.

**Task Objective**

DCNs excel in a variety of tasks, including but not limited to:
- Image and video recognition
- Image classification
- Object detection
- Semantic segmentation
- Natural language processing (when applied to text data in a convolutional manner)

**Scalability**

One of the key strengths of DCNs is their scalability, not only in terms of handling large volumes of data but also in terms of their capacity to learn from complex and high-dimensional datasets. Their architecture, characterized by layers with convolutions, pooling, and often followed by fully connected layers, allows for efficient training and inference.

**Robustness to Noise**

DCNs are known for their robustness to variations and noise in the input data, making them particularly suitable for real-world applications where data imperfection is common. This robustness stems from their ability to learn invariant features that are critical for recognition tasks.

**Implementation Variants**

There are several popular variants and architectures of DCNs, including:
- **LeNet:** One of the first convolutional networks that demonstrated the effectiveness of convolutional layers.
- **AlexNet:** The network that revitalized interest in convolutional neural networks with its success in the ImageNet challenge.
- **VGGNet:** Known for its simplicity and deep architecture.
- **ResNet:** Introduced residual connections to enable the training of very deep networks.
- **Inception (GoogleNet):** Known for its efficiency and depth with a lower number of parameters.

**Practical Application Guidance**

**When to Use Deep Convolutional Networks:**
- In applications involving image or video processing where capturing spatial hierarchies is crucial.
- For tasks requiring the extraction of complex features from large and high-dimensional datasets.

**Considerations:**
- The design of the network architecture and the choice of hyperparameters are critical for achieving optimal performance.
- Training deep convolutional networks requires substantial computational resources, particularly for large datasets and complex models.

**Conclusion**

Deep Convolutional Networks have revolutionized the field of computer vision and beyond, demonstrating unparalleled success in a wide range of applications involving visual data processing. Their ability to learn powerful representations of data makes them a cornerstone of modern deep learning, with ongoing research pushing the boundaries of what is possible in artificial intelligence.


In [19]:
class DeepCNN(BaseNN):
    def __init__(self, input_channels, num_classes, image_size):
        hidden_size = 64

        super(DeepCNN, self).__init__(input_size=image_size, hidden_size=hidden_size, output_size=num_classes)

        self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(64 * (image_size // 4) * (image_size // 4), hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * (self.input_size // 4) * (self.input_size // 4))
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [20]:
# Example usage
input_channels = 1  # Grayscale images
num_classes = 10
image_size = 28

deep_cnn_model = DeepCNN(input_channels, num_classes, image_size)

sample_image = torch.rand((1, input_channels, image_size, image_size))

output_scores = deep_cnn_model(sample_image)

print(deep_cnn_model)
print("Input:", sample_image)
print("Output Scores:", output_scores.detach().numpy())

DeepCNN(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=3136, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=10, bias=True)
)
Input: tensor([[[[4.5906e-01, 9.7879e-01, 1.5767e-01, 1.5335e-01, 5.5557e-01,
           8.6285e-01, 6.4839e-01, 5.1157e-01, 3.6890e-01, 6.9478e-01,
           6.8469e-01, 7.8236e-01, 8.4249e-02, 2.5696e-01, 6.6906e-01,
           7.7359e-01, 3.7520e-01, 4.7202e-01, 6.3767e-01, 5.4821e-01,
           6.2599e-01, 5.4250e-02, 7.0396e-01, 6.0149e-01, 5.8098e-01,
           7.4248e-01, 5.5248e-01, 6.0071e-01],
          [2.9531e-01, 5.5693e-01, 4.0573e-01, 3.6962e-01, 5.4100e-02,
           4.2071e-01, 2.9789e-01, 4.8125e-01, 8.6545e-01, 5.7631e-01,
           6.8455e-01, 1.8524e-01, 7.4843e-01, 4.5054e-01, 7.6650e-01,
      

# Recurrent and Memory Models

## Recurrent Neural Network (RNN)

**High-Level Overview**

Recurrent Neural Networks (RNNs) are a class of neural networks that excel at processing sequential data, making them particularly well-suited for tasks involving time-series data, natural language processing, and any scenario where the context or the order of elements is crucial.

RNNs are characterized by their ability to maintain a 'memory' of previous inputs by incorporating loops within the network. This memory allows them to exhibit dynamic temporal behavior and understand sequences, distinguishing them from other neural network architectures that treat inputs independently.

**Data Type**

RNNs are adept at handling:

- Sequential data
- Time-series data
- Text and spoken language
- Any form of data where the sequence order is significant

Their design enables them to process inputs of varying lengths, from short sentences to lengthy documents or extensive time-series datasets.

**Task Objective**

RNNs are particularly useful for:

- Language modeling and text generation
- Speech recognition
- Time-series forecasting
- Sentiment analysis and other forms of text classification

The recurrent nature of RNNs allows them to capture temporal patterns and dependencies, making them a powerful tool for tasks that require an understanding of context within sequences.

**Scalability**

While RNNs theoretically can handle long sequences, they often face challenges with long-term dependencies due to issues like vanishing or exploding gradients. Techniques such as Long Short-Term Memory (LSTM) cells and Gated Recurrent Units (GRUs) have been developed to mitigate these issues and enhance RNNs' ability to scale to longer sequences. These two (LSTM & GRU) follow this section.

**Robustness to Noise**

RNNs can be sensitive to noise in the sequence data, especially if the noise affects the temporal dependencies that the network aims to learn. Regularization techniques and careful design of the network architecture are crucial to improve their robustness.

**Implementation Variants**

- **Long Short-Term Memory (LSTM):** Designed to overcome the vanishing gradient problem, allowing RNNs to learn long-term dependencies.
- **Gated Recurrent Units (GRUs):** A simpler alternative to LSTMs that achieves similar performance with fewer parameters.
- **Bidirectional RNNs:** Process the data in both forward and backward directions, providing additional context and improving performance on tasks like speech recognition.

**Practical Application Guidance**

**When to Use Recurrent Neural Networks:**

- When dealing with sequential data where the order and context significantly impact the output.
- For tasks requiring the model to remember and utilize past information over short or long sequences.

**Considerations:**

- Training RNNs can be computationally intensive and may require sophisticated techniques to deal with challenges like vanishing gradients.
- The choice between plain RNNs, LSTMs, and GRUs depends on the specific requirements of the task, including the need for modeling long-term dependencies.

**Conclusion**

Recurrent Neural Networks represent a cornerstone in the field of sequential data analysis, offering the ability to process and generate predictions based on the context of input sequences.

Their unique structure, capable of maintaining a form of internal memory, makes them indispensable for tasks that require understanding the temporal dynamics of data. Despite challenges in training and scalability, advancements like LSTMs and GRUs (seen following this) continue to push the boundaries of what RNNs can achieve, making them a valuable tool in the ever-evolving landscape of neural network architectures.

In [21]:
class SimpleRNN(BaseNN):
    def __init__(self, input_size, hidden_size):
        super(SimpleRNN, self).__init__(input_size, hidden_size, output_size=1)
        self.input_layer = nn.Linear(input_size, hidden_size)
        self.recurrent_layer = nn.RNN(hidden_size, hidden_size, batch_first=True)
        self.output_layer = nn.Linear(hidden_size, self.output_size)

    def forward(self, x):
        x = torch.relu(self.input_layer(x))
        h_t, _ = self.recurrent_layer(x)
        output = torch.sigmoid(self.output_layer(h_t[:, -1, :]))  # Taking the output from the last time step
        return output

In [22]:
# Instantiate RNN Model
simple_rnn_model = SimpleRNN(input_size,hidden_size)

# Forward pass for the simple RNN model
sample_input_rnn = torch.rand((1, 4, input_size))
output_rnn = simple_rnn_model(sample_input_rnn)

print("Simple RNN Model:")
print(simple_rnn_model)
print("Output:", output_rnn.item())

Simple RNN Model:
SimpleRNN(
  (input_layer): Linear(in_features=5, out_features=3, bias=True)
  (recurrent_layer): RNN(3, 3, batch_first=True)
  (output_layer): Linear(in_features=3, out_features=1, bias=True)
)
Output: 0.4832405745983124


# Long Short Term Memory (LSTM)

**High-Level Overview**

Long Short Term Memory networks (LSTMs) are a specialized form of Recurrent Neural Networks (RNNs) designed to address the limitations of traditional RNNs, particularly in learning long-term dependencies. 

LSTMs are equipped with a unique architecture that includes memory cells and multiple gates (input, output, and forget gates), enabling them to maintain information over extended sequences and effectively manage the flow of information.

**Data Type**

LSTMs are versatile and can process a wide range of sequential data, including:

- Textual data for natural language processing tasks.
- Time-series data for forecasting in finance, weather, and more.
- Sequential inputs from sensors for activity recognition or medical diagnosis.
- Audio signals for speech recognition and music generation.

Their ability to capture temporal dependencies makes them ideal for applications where the sequence and context of the data matter.

**Task Objective**

LSTMs excel in tasks that require understanding and remembering context over long sequences, such as:

- Language modeling and text generation.
- Sequence to sequence translation, e.g., language translation.
- Speech recognition and synthesis.
- Complex time-series prediction and anomaly detection.

**Scalability**

The sophisticated gating mechanisms of LSTMs allow them to scale well to longer sequences, addressing the vanishing gradient problem common in simpler RNNs. However, this complexity can lead to higher computational costs during training and inference, making efficiency and optimization key considerations for large-scale applications.

**Robustness to Noise**

Thanks to their ability to selectively remember and forget information, LSTMs demonstrate robustness to noisy and irrelevant inputs, making them suitable for real-world applications where data quality can vary.

**Implementation Variants**

Several variants and improvements on the original LSTM architecture have been proposed to enhance performance and efficiency, including:

- **Bidirectional LSTMs (BiLSTMs):** Process data in both forward and backward directions, improving context understanding.
- **Gated Recurrent Units (GRUs):** A simplified version of LSTMs that combines the input and forget gates into a single update gate.
- **Peephole LSTMs:** Allow the gates to access the cell state directly, enhancing the control over the memory cell.

**Practical Application Guidance**

**When to Use LSTMs:**

- For tasks involving sequences where the context and the temporal order of data points are crucial for making accurate predictions or decisions.
- In scenarios where learning long-term dependencies is essential for performance.

**Considerations:**

- While powerful, LSTMs can be more challenging to train and fine-tune due to their complexity and the larger number of parameters compared to simpler models.
- Careful design of the network architecture and selection of hyperparameters are essential to balance performance with computational efficiency.

**Conclusion**

Long Short Term Memory networks have revolutionized the handling of sequential data by enabling models to learn and remember over long sequences. Their design mitigates the challenges associated with traditional RNNs, making them a cornerstone for a wide array of applications in natural language processing, time-series analysis, and beyond. As research continues, LSTMs remain a critical tool in the deep learning toolkit, driving advancements in understanding sequential data.

In [23]:
# LSTM Neural Network
class LSTMNN(BaseNN):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMNN, self).__init__(input_size, hidden_size, output_size)
        self.lstm_layer = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.output_layer = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        _, (h_t, c_t) = self.lstm_layer(x)
        output = torch.sigmoid(self.output_layer(h_t[-1, :, :]))  # Taking the output from the last time step
        return output

In [75]:
# Instantiate the LSTM neural network
input_size = 3  # Number of input features 
hidden_size = 3  # Number of memory cells
output_size = 4  # Number of output nodes
lstm_model = LSTMNN(input_size, hidden_size, output_size)

# Define a sample input
sample_input = torch.rand((1, 4, input_size))  # Example Data

# Forward pass to get the output
output_lstm = lstm_model(sample_input)

# Print the model architecture and output
print(lstm_model)
print("Input:", sample_input)
print("Output:", output_lstm)

LSTMNN(
  (lstm_layer): LSTM(3, 3, batch_first=True)
  (output_layer): Linear(in_features=3, out_features=4, bias=True)
)
Input: tensor([[[0.2538, 0.5605, 0.0580],
         [0.7429, 0.5649, 0.5510],
         [0.3359, 0.9129, 0.1333],
         [0.4529, 0.0286, 0.2647]]])
Output: tensor([[0.4000, 0.5383, 0.6445, 0.6661]], grad_fn=<SigmoidBackward0>)


# Gated Recurrent Unit (GRU)

In [25]:
class GRUNN(BaseNN):
    def __init__(self, input_size, hidden_size, output_size=1):
        super(GRUNN, self).__init__(input_size, hidden_size, output_size)
        self.gru_layer = nn.GRU(input_size, hidden_size, batch_first=True)
        self.output_layer = nn.Linear(hidden_size, self.output_size)

    def forward(self, x):
        h_t, _ = self.gru_layer(x)
        output = torch.sigmoid(self.output_layer(h_t[:, -1, :]))  # Taking the output from the last time step
        return output

In [76]:
# Instantiate the GRU neural network
input_size = 3  # Number of input features 
hidden_size = 3  # Number of memory cells
output_size = 4  # Number of output nodes
gru_model = GRUNN(input_size, hidden_size, output_size)

# Define a sample input
sample_input = torch.rand((1, 4, input_size))  # Example Data

# Forward pass to get the output
output_gru = gru_model(sample_input)

# Print the model architecture and output
print(gru_model)
print("Input:", sample_input)
print("Output:", output_gru)

GRUNN(
  (gru_layer): GRU(3, 3, batch_first=True)
  (output_layer): Linear(in_features=3, out_features=4, bias=True)
)
Input: tensor([[[0.1642, 0.7892, 0.1311],
         [0.1254, 0.6721, 0.6972],
         [0.3529, 0.5542, 0.5755],
         [0.7039, 0.1032, 0.5056]]])
Output: tensor([[0.4682, 0.5686, 0.4778, 0.5822]], grad_fn=<SigmoidBackward0>)


# Auto Encoder (AE)

AE designed for unsupervised learning & Data compression.

Learns compact representation of input data.

Used for data denoising, dimensionality reduction, feature learning.

Versitile building block in utilizing NN's.

In [27]:
class Autoencoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Linear(input_size, hidden_size)
        self.decoder = nn.Linear(hidden_size, input_size)

    def forward(self, x):
        encoded = torch.relu(self.encoder(x))
        decoded = torch.sigmoid(self.decoder(encoded))
        return decoded

In [28]:
# Instantiate the autoencoder
input_size = 10  # Number of input features
hidden_size = 5  # Number of hidden nodes (compressed representation)
autoencoder = Autoencoder(input_size, hidden_size)

# Define a sample input
sample_input = torch.rand((1, input_size))

# Forward pass to get the reconstructed output
output_autoencoder = autoencoder(sample_input)

# Print the model architecture and output
print(autoencoder)
print("Output:", output_autoencoder)

Autoencoder(
  (encoder): Linear(in_features=10, out_features=5, bias=True)
  (decoder): Linear(in_features=5, out_features=10, bias=True)
)
Output: tensor([[0.4638, 0.5080, 0.6151, 0.4351, 0.5180, 0.5389, 0.4567, 0.5438, 0.5156,
         0.3931]], grad_fn=<SigmoidBackward0>)


# Variational AE (VAE)

**High-Level Overview**

Variational Autoencoders (VAEs) are a cornerstone in the field of generative AI, representing a powerful class of deep learning models for generative modeling. They are designed to learn the underlying probability distribution of training data, enabling the generation of new data points with similar properties. VAEs combine traditional autoencoder architecture with variational inference principles, allowing them to compress data into a latent space and then generate data by sampling from this space, thereby facilitating a deep exploration of the continuous latent space representing the data.

**Data Type**

VAEs demonstrate remarkable adaptability across a range of data types, including:
- Images
- Text
- Audio
- Continuous numerical data

This versatility underscores their prominence in generative AI, making them a popular choice for a wide array of generative tasks.

**Task Objective**

Emphasizing their role in generative AI, VAEs excel in:
- Data generation
- Feature extraction and representation learning
- Dimensionality reduction
- Anomaly detection

Their deep learning capabilities enable them not only to model complex distributions but also to generate new, coherent samples, showcasing the transformative potential of generative AI.

**Scalability**

With their deep neural network architecture, VAEs scale effectively to accommodate the complexity and volume of vast datasets, further solidifying their status in generative AI for handling high-dimensional data efficiently.

**Robustness to Noise**

VAEs' proficiency in denoising and reconstructing inputs highlights their robustness, making them invaluable for applications in generative AI where data cleanliness cannot be assured.

**Implementation Variants**

Reflecting the innovation in generative AI, various VAE models have been developed to address specific challenges or improve upon the original framework, including Conditional VAEs, Beta-VAEs, and Disentangled VAEs, each offering unique advantages for controlled data generation and enhanced interpretability of latent representations.

**Practical Application Guidance**

In the realm of generative AI, VAEs are particularly suited for:
- Generating new data that mimics the properties of specific datasets.
- Unsupervised learning of complex data distributions.
- Applications requiring a nuanced understanding of data's underlying structure.

**Considerations:**

Training VAEs can present challenges, such as mode collapse, underscoring the need for expertise in generative AI to navigate these complexities successfully.

### Conclusion

Variational Autoencoders (VAEs) have cemented their place as a fundamental technology in generative AI, offering a sophisticated mechanism for understanding and generating data. Their broad applicability and the depth of insight they provide into data's inherent structure make them a pivotal tool in the advancement of generative modeling.

In [29]:
class VariationalAutoencoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(VariationalAutoencoder, self).__init__()

        # Encoder layers
        self.encoder_fc1 = nn.Linear(input_size, hidden_size)
        self.encoder_fc2_mean = nn.Linear(hidden_size, hidden_size)
        self.encoder_fc2_logvar = nn.Linear(hidden_size, hidden_size)

        # Decoder layers
        self.decoder_fc1 = nn.Linear(hidden_size, input_size)
        self.decoder_fc2 = nn.Linear(input_size, input_size)

    def reparameterize(self, mean, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mean + eps * std

    def forward(self, x):
        # Encoder
        x = torch.relu(self.encoder_fc1(x))
        mean = self.encoder_fc2_mean(x)
        logvar = self.encoder_fc2_logvar(x)

        # Reparameterization trick
        z = self.reparameterize(mean, logvar)

        # Decoder
        x_hat = torch.relu(self.decoder_fc1(z))
        x_hat = torch.sigmoid(self.decoder_fc2(x_hat))

        return x_hat, mean, logvar

In [30]:
# Instantiate the variational autoencoder
input_size = 4  # Number of input features
hidden_size = 4  # Number of hidden nodes in probabilistic layer
vae = VariationalAutoencoder(input_size, hidden_size)

# Define a sample input
sample_input = torch.rand((1, input_size))

# Forward pass to get the reconstructed output and latent variables
output_vae, mean, logvar = vae(sample_input)

# Print the model architecture and output
print(vae)
print("Output:", output_vae)
print("Mean:", mean)
print("Log Variance:", logvar)

VariationalAutoencoder(
  (encoder_fc1): Linear(in_features=4, out_features=4, bias=True)
  (encoder_fc2_mean): Linear(in_features=4, out_features=4, bias=True)
  (encoder_fc2_logvar): Linear(in_features=4, out_features=4, bias=True)
  (decoder_fc1): Linear(in_features=4, out_features=4, bias=True)
  (decoder_fc2): Linear(in_features=4, out_features=4, bias=True)
)
Output: tensor([[0.4524, 0.5483, 0.4470, 0.4638]], grad_fn=<SigmoidBackward0>)
Mean: tensor([[-0.2343, -0.4617, -0.1976,  0.0995]], grad_fn=<AddmmBackward0>)
Log Variance: tensor([[-0.3335,  0.1722, -0.4423,  0.1874]], grad_fn=<AddmmBackward0>)


# Denoising Auto Encoder

**High-Level Overview**

Denoising Autoencoders (DAEs) are an advanced type of autoencoder designed to *remove noise from input data*. By intentionally corrupting the input data and then learning to reconstruct the original, uncorrupted data, DAEs are trained to capture the most relevant features. This process enhances the model's ability to generalize from the data, making it highly effective for tasks that require robust feature extraction and data denoising capabilities.

**Data Type**

Denoising Autoencoders are capable of processing various data types, including:
- Images
- Text
- Audio signals
- Continuous numerical data

Their adaptability makes them particularly useful for applications involving noisy or incomplete data.

**Task Objective**

Denoising Autoencoders are primarily used for:
- Data denoising
- Feature extraction and representation learning
- Dimensionality reduction
- Data generation and enhancement

By learning to ignore the "noise" in data, DAEs excel in recovering clean representations from corrupted inputs.

**Scalability**

Similar to other autoencoders, the scalability of DAEs depends on the network architecture. Modern techniques and computational resources allow DAEs to handle large datasets and complex noise patterns effectively, showcasing their scalability in practical applications.

**Robustness to Noise**

The core strength of DAEs lies in their robustness to noise. They are specifically trained to identify and ignore irrelevant features (noise), focusing on reconstructing the essential aspects of the data, which makes them exceptionally reliable for denoising tasks.

**Implementation Variants**

Several variants of DAEs have been developed to address different types of noise or to enhance specific aspects of denoising, including:
- **Gaussian Noise DAEs:** Target Gaussian noise in the data.
- **Salt-and-Pepper Noise DAEs:** Designed to remove binary noise from images.
- **Variational DAEs:** Combine denoising capabilities with variational autoencoder frameworks for improved generative properties.

**Practical Application Guidance**

**When to Use Denoising Autoencoders:**
- For cleaning noisy data before further processing or analysis.
- In feature extraction tasks where maintaining data integrity is crucial.
- As a preprocessing step to improve the performance of subsequent machine learning models.

**Considerations:**
- The effectiveness of a DAE can vary based on the noise type and level; selecting the appropriate model variant is key.
- Training DAEs requires a balance between denoising capability and preserving relevant features, necessitating careful tuning of model parameters.

### Conclusion

Denoising Autoencoders offer a powerful solution for improving data quality, with their unique training strategy enabling them to extract clean, relevant features from noisy inputs. Their versatility across different data types and robustness to various noise patterns make them an invaluable tool in the data preprocessing pipeline, enhancing the performance of machine learning and deep learning models across a wide range of applications.

In [31]:
class DenoisingAutoencoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(DenoisingAutoencoder, self).__init__()

        # Encoder layers
        self.encoder_fc1 = nn.Linear(input_size, hidden_size)

        # Decoder layers
        self.decoder_fc1 = nn.Linear(hidden_size, input_size)

    def forward(self, x):
        # Encoder
        x = torch.relu(self.encoder_fc1(x))

        # Decoder
        x_hat = torch.sigmoid(self.decoder_fc1(x))

        return x_hat

In [32]:
# Instantiate the denoising autoencoder
input_size = 4  # Number of input features
hidden_size = 4  # Number of hidden nodes
dae = DenoisingAutoencoder(input_size, hidden_size)

# Define a sample input (noisy data)
noisy_input = torch.rand((1, input_size))  # Example Noisy Data

# Forward pass to get the reconstructed output
output_dae = dae(noisy_input)

# Print the model architecture and output
print(dae)
print("Noisy Input:", noisy_input)
print("Reconstructed Output:", output_dae)

DenoisingAutoencoder(
  (encoder_fc1): Linear(in_features=4, out_features=4, bias=True)
  (decoder_fc1): Linear(in_features=4, out_features=4, bias=True)
)
Noisy Input: tensor([[0.4459, 0.3527, 0.6623, 0.0946]])
Reconstructed Output: tensor([[0.5092, 0.5703, 0.4878, 0.3946]], grad_fn=<SigmoidBackward0>)


# Sparse Auto Encoder

**High-Level Overview**

Sparse Autoencoders represent a specialized variant of autoencoders, aimed at *unsupervised learning of compressed representations*. By introducing sparsity constraints, they enforce most neurons to be inactive, enhancing feature detection and data representation efficiency. This approach improves generalization, making them suitable for tasks requiring robust feature extraction.

**Data Type**

Sparse Autoencoders efficiently process:
- Images
- Text
- Audio signals
- Continuous numerical data

Their versatility across different data types highlights their utility in feature extraction and data compression tasks.

**Task Objective**

Key applications include:
- Feature extraction and representation learning
- Dimensionality reduction
- Data denoising
- Pretraining for deeper neural networks

Sparsity constraints enable these models to learn higher-level features, distinguishing them from traditional autoencoders.

**Scalability**

Despite sparsity aiding in learning efficient representations, the network's size and depth impact its ability to model complex distributions and computational requirements.

**Robustness to Noise**

They demonstrate significant robustness to noise, attributed to their focus on essential features, making them ideal for denoising and robust representation learning.

**Implementation Variants**

Variants are based on the sparsity enforcement method:
- **KL Divergence Sparse Autoencoder:** Penalizes deviations from a target sparsity level using Kullback-Leibler divergence.
- **L1 Regularization Sparse Autoencoder:** Applies L1 penalty on hidden units' activations to encourage sparsity.
- **Winner-Take-All (WTA) Sparse Autoencoder:** Only a fraction of the most active hidden units are allowed to update their weights, enhancing sparsity.

**Practical Application Guidance**

**When to Use Sparse Autoencoders:**
- In extracting meaningful features from high-dimensional data.
- For dimensionality reduction with interpretability.
- During pretraining phases for deep learning models, providing a good initial weight set that captures useful data patterns.

**Considerations:**
- Selecting the appropriate sparsity constraint and regularization technique is critical for balancing feature selectivity and model complexity.
- Hyperparameters require careful tuning to achieve desired sparsity levels and optimal performance.

### Conclusion

Sparse Autoencoders stand out for learning efficient and interpretable data representations, with enforced sparsity offering clear advantages in feature selection and model robustness. They are invaluable in preprocessing, feature extraction, and as a pretraining step, enhancing subsequent models' performance across various data types and applications.

In [33]:
class SparseAutoencoder(BaseNN):
    def __init__(self, input_size, hidden_size, sparsity_target=0.1, sparsity_weight=0.2):
        super(SparseAutoencoder, self).__init__(input_size, hidden_size, output_size=input_size)

        # Encoder layers
        self.encoder_fc1 = nn.Linear(input_size, hidden_size)

        # Decoder layers
        self.decoder_fc1 = nn.Linear(hidden_size, input_size)

        # Sparsity parameters
        self.sparsity_target = sparsity_target
        self.sparsity_weight = sparsity_weight
        self.relu = nn.ReLU()

    def forward(self, x):
        # Encoder
        encoded = self.encoder_fc1(x)
        encoded = self.relu(encoded)

        # Decoder
        decoded = torch.sigmoid(self.decoder_fc1(encoded))

        return decoded, encoded

    def loss_function(self, x, x_hat, encoded):
        # Reconstruction loss
        reconstruction_loss = nn.functional.binary_cross_entropy(x_hat, x, reduction='mean')

        # Sparsity loss
        sparsity_loss = torch.sum(self.kl_divergence(self.sparsity_target, encoded))

        # Total loss
        total_loss = reconstruction_loss + self.sparsity_weight * sparsity_loss

        return total_loss

    def kl_divergence(self, target, activations):
        # KL Divergence to enforce sparsity
        p = torch.mean(activations, dim=0)  # Average activation over the dataset
        return target * torch.log(target / p) + (1 - target) * torch.log((1 - target) / (1 - p))

In [34]:
# Instantiate the sparse autoencoder
input_size = 5  # Number of input features
hidden_size = 3  # Number of hidden nodes
sae = SparseAutoencoder(input_size, hidden_size)

# Define a sample input
sample_input = torch.rand((1, input_size))  # Example Data

# Forward pass to get the reconstructed output and encoded representation
output_sae, encoded_sae = sae(sample_input)

# Calculate the loss
loss_sae = sae.loss_function(sample_input, output_sae, encoded_sae)

# Print the model architecture, output, and loss
print(sae)
print("Input:", sample_input)
print("Reconstructed Output:", output_sae)
print("Encoded Representation:", encoded_sae)
print("Loss:", loss_sae.item())

SparseAutoencoder(
  (encoder_fc1): Linear(in_features=5, out_features=3, bias=True)
  (decoder_fc1): Linear(in_features=3, out_features=5, bias=True)
  (relu): ReLU()
)
Input: tensor([[0.9295, 0.4682, 0.1434, 0.8834, 0.6853]])
Reconstructed Output: tensor([[0.3732, 0.6395, 0.5619, 0.5443, 0.6051]], grad_fn=<SigmoidBackward0>)
Encoded Representation: tensor([[0.0000, 0.3336, 0.0000]], grad_fn=<ReluBackward0>)
Loss: inf


# Markov Chain (MC)

**High-Level Overview**

Markov Chains represent a stochastic model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. This mathematical framework is fundamental in the study of random processes and is widely applicable across various domains, including statistical mechanics, economics, and predictive modeling. Markov Chains are particularly valued for their simplicity and power in modeling the randomness of systems evolving over time.

**Data Type**

Markov Chains are applicable to:
- Discrete events or states
- Temporal or spatial sequences

Their adaptability allows them to model a wide array of processes, from simple random walks to complex decision-making scenarios.

**Task Objective**

Markov Chains excel in:
- Predicting state transitions
- Modeling random processes
- Decision making under uncertainty

Their predictive capabilities make them an essential tool for scenarios where future states depend on the current state, without the need for historical data.

**Scalability**

Markov Chains scale well with the complexity of the model, primarily influenced by the number of states. While larger state spaces increase computational demands, advancements in algorithms and computing power have made it feasible to tackle complex chains efficiently.

**Robustness to Noise**

Given their probabilistic nature, Markov Chains naturally incorporate and manage uncertainty and noise within their models. This robustness makes them suitable for applications where data may be incomplete or inherently random.

**Implementation Variants**

Markov Chains come in various forms, including:
- **Discrete-Time Markov Chains:** Model transitions in discrete time steps.
- **Continuous-Time Markov Chains:** Allow for transitions at any point in time.
- **Hidden Markov Models (HMMs):** Extend Markov Chains by allowing observations to be a probabilistic function of the state, useful in scenarios where states are not directly observable.

**Practical Application Guidance**

**When to Use Markov Chains:**
- For modeling sequential or temporal data where future states depend on the current state.
- In decision-making processes to evaluate different strategies under uncertainty.
- When analyzing systems or processes that evolve over time in predictable patterns.

**Considerations:**
- Markov Chains assume the future is independent of the past given the present state, which may not hold in systems with memory or where historical context is crucial.
- They are best applied to processes where this assumption of memorylessness (the Markov property) is reasonable or where state transitions are primarily influenced by the current state.

### Conclusion

Markov Chains offer a powerful and flexible framework for modeling random processes and making predictions based on state transitions. By understanding their structure, capabilities, and the variety of their applications, one can effectively leverage Markov Chains to gain insights into complex systems, predict future events, and make informed decisions under uncertainty.

In [35]:
class MarkovChainNN(BaseNN):
    def __init__(self, input_size, hidden_size):
        super(MarkovChainNN, self).__init__(input_size, hidden_size, output_size=input_size)
        self.transition_matrix = nn.Parameter(torch.randn(input_size, hidden_size))
        self.output_layer = nn.Linear(hidden_size, input_size)

    def forward(self, x):
        # Apply a simple linear transformation based on the transition matrix
        x = torch.matmul(x, self.transition_matrix)
        # Apply a linear layer to get the final output
        output = self.output_layer(x)
        # You might want to apply some non-linearity here based on your specific needs
        # For example, you can use torch.relu(output) or torch.sigmoid(output) depending on the task
        return output

# Instantiate the Markov Chain neural network
input_size = 4  # Number of input features 
hidden_size = 8  # Number of hidden states
markov_chain_model = MarkovChainNN(input_size, hidden_size)

# Define a sample input
sample_input = torch.rand((1, input_size))  # Example Data

# Forward pass to get the output
output_markov_chain = markov_chain_model(sample_input)

In [36]:
# Print the model architecture and output
print(markov_chain_model)
print("Output:", output_markov_chain)

MarkovChainNN(
  (output_layer): Linear(in_features=8, out_features=4, bias=True)
)
Output: tensor([[-0.2257,  0.1200, -0.0358,  0.9254]], grad_fn=<AddmmBackward0>)


# Hopfield Network

**High-Level Overview**

Hopfield Networks are a form of recurrent neural network with a unique structure that allows them to serve as associative memory systems. These networks are characterized by fully connected neurons with symmetric weight matrices, enabling them to converge to stable states or "memories". This architecture makes Hopfield Networks particularly adept at solving optimization and memory recall tasks, leveraging their ability to find energy minima to recall stored patterns.

**Data Type**

Hopfield Networks primarily deal with:
- Binary data
- Bipolar data

Their structure is optimized for patterns represented in these formats, making them suitable for tasks that can be encoded as binary or bipolar vectors.

**Task Objective**

Hopfield Networks are well-suited for:
- Pattern recognition
- Associative memory recall
- Optimization problems

Their ability to serve as content-addressable ("associative") memory systems allows them to recall entire patterns based on partial or noisy inputs, showcasing their strength in tasks requiring robust pattern completion and error correction.

**Scalability**

While Hopfield Networks provide powerful capabilities for pattern recognition and memory recall, their scalability is limited by the network size due to the fully connected nature of the architecture. The capacity of a Hopfield Network to store memories without error is approximately 15% of the number of neurons, limiting the size of problems they can effectively solve without modifications or extensions.

**Robustness to Noise**

A key feature of Hopfield Networks is their robustness to noise in input patterns. They can recover original stored patterns from inputs that are partially incorrect or incomplete, making them highly effective for tasks requiring error tolerance and noise reduction in pattern recall.

**Implementation Variants**

To address scalability and efficiency, several variants of Hopfield Networks have been developed, including:
- **Continuous Hopfield Networks:** Extend the binary model to continuous values, allowing for application to a wider range of problems.
- **Stochastic Hopfield Networks:** Introduce randomness in the update rules, enhancing the network's ability to escape local minima and find better solutions for optimization problems.

**Practical Application Guidance**

**When to Use Hopfield Networks:**
- When the task involves recovering or completing partial patterns.
- For optimization problems where potential solutions can be encoded as binary or bipolar vectors.
- In applications where associative memory models offer a natural solution.

**Considerations:**
- Hopfield Networks are not well-suited for large-scale problems due to their limited storage capacity and the computational cost of fully connected networks.
- They may not be the best choice for new tasks with high-dimensional data or where deep learning approaches have demonstrated superior performance.

### Conclusion

Hopfield Networks offer a approach to associative memory and optimization problems, with their unique ability to recall stored patterns from noisy or incomplete inputs. Understanding their structure, capabilities, and limitations is crucial for leveraging their strengths in relevant applications, while recognizing when alternative neural network models might be more appropriate.

In [37]:
class HopfieldNetwork(BaseNN):
    def __init__(self, input_size):
        super(HopfieldNetwork, self).__init__(input_size, hidden_size=None, output_size=input_size)
        
        # Weight matrix for the Hopfield Network
        self.weights = nn.Parameter(torch.zeros((input_size, input_size), dtype=torch.float))

    def forward(self, x):
        # Apply the Hopfield Network dynamics
        y = torch.sign(x @ self.weights).long()  # Convert to torch.long after applying sign
        return y


In [38]:
# Example usage
input_size = 5  # Change this based on your needs
hopfield_model = HopfieldNetwork(input_size)

# Define a sample input pattern (1 or -1)
sample_input = torch.tensor([[1, -1, 1, -1, 1]], dtype=torch.float)  # Change the data type to torch.float

# Forward pass to retrieve the output
output_hopfield = hopfield_model(sample_input)

# Print the model architecture and output
print(hopfield_model)
print("Output:", output_hopfield.numpy())

HopfieldNetwork()
Output: [[0 0 0 0 0]]


# Boltzmann Machine (BM)

BM is a stochastic RNN designed to find a probability distribution over its set of binary-valued patterns. 

*Main Objective* is to learn the joint probablity distribution of its training data.

Unique Features:
-visible & hidden units forming bipartite graph
-connects between units have weights & model learns weights during training
-stochastic update process for both training & inference

Common Uses:
-unsupervised learning tasks like feature learning, dimensionality reduction, density estimation.

In [39]:
class BoltzmannMachine(nn.Module):
    def __init__(self, num_visible, num_hidden):
        super(BoltzmannMachine, self).__init__()
        self.num_visible = num_visible
        self.num_hidden = num_hidden

        # Define the parameters (weights and biases)
        self.weights = nn.Parameter(torch.randn(num_visible, num_hidden))
        self.visible_bias = nn.Parameter(torch.randn(num_visible))
        self.hidden_bias = nn.Parameter(torch.randn(num_hidden))

    def forward(self, visible_states):
        # Ensure visible_states has the correct dimensions (batch_size x num_visible)
        if visible_states.dim() == 1:
            visible_states = visible_states.view(1, -1)

        # Compute the hidden probabilities given visible states
        hidden_probabilities = F.sigmoid(F.linear(visible_states, self.weights.t(), self.hidden_bias))

        # Sample hidden states from the computed probabilities
        hidden_states = torch.bernoulli(hidden_probabilities)

        # Compute the visible probabilities given the sampled hidden states
        visible_probabilities = F.sigmoid(F.linear(hidden_states, self.weights, self.visible_bias))

        # Sample visible states from the computed probabilities
        visible_states = torch.bernoulli(visible_probabilities)

        return visible_states, hidden_states

In [40]:
# Example usage
num_visible = 5
num_hidden = 3

boltzmann_machine = BoltzmannMachine(num_visible, num_hidden)

# Define a sample visible state (binary values)
sample_visible_state = torch.tensor([1, 0, 1, 0, 1.], dtype=torch.float)

# Perform a Gibbs sampling step
sampled_visible, sampled_hidden = boltzmann_machine(sample_visible_state)

# Print the model architecture and sampled states
print(boltzmann_machine)
print("Sampled Visible State:", sampled_visible.detach().numpy())
print("Sampled Hidden State:", sampled_hidden.detach().numpy())

BoltzmannMachine()
Sampled Visible State: [[0. 1. 1. 1. 1.]]
Sampled Hidden State: [[1. 0. 0.]]


# Restricted Boltzmann Machine (RBM)

Differences from normal Boltzmann Machine:

    -No connections between units within same layer (no hidden-hidden or visible-visible connections)
    
    -Bipartite graph with one layer of visible units and one layer of hidden units
    
Objective: 

    -RBM objective with changes for potentially more effective feature learning

Unique Features:

    -Widely used for feature learning & are building blocks in deep learning architectures.
    
    -Efficent training, Contrastive Divergence (CD) is often used for training RBMs
Common Uses:

    -Feature Learning; pre-training deep NN
    
    -Collaborative filtering, topic modeling, other unsupervised learning tasks.

In [41]:
class RBM(BaseNN):
    def __init__(self, visible_size, hidden_size):
        super(RBM, self).__init__(visible_size, hidden_size, None)
        self.weights = nn.Parameter(torch.randn(visible_size, hidden_size))
        self.visible_bias = nn.Parameter(torch.zeros(visible_size))
        self.hidden_bias = nn.Parameter(torch.zeros(hidden_size))

    def forward(self, x):
        hidden_prob = F.sigmoid(F.linear(x, self.weights.t(), self.hidden_bias))
        hidden_state = torch.bernoulli(hidden_prob)
        reconstructed_prob = F.sigmoid(F.linear(hidden_state, self.weights, self.visible_bias))
        return hidden_state, reconstructed_prob

In [42]:
# Example usage
visible_size = 5
hidden_size = 3

# Create an RBM model
rbm_model = RBM(visible_size, hidden_size)

# Define a sample visible state (binary values)
sample_visible_state = torch.tensor([[1, 0, 1, 0, 1.]], dtype=torch.float)

# Forward pass to get the hidden states
hidden_states, _ = rbm_model(sample_visible_state)

# Print the model architecture and hidden states
print(rbm_model)
print("Sampled Visible State:", sample_visible_state.detach().numpy())
print("Hidden States:", hidden_states.detach().numpy())

RBM()
Sampled Visible State: [[1. 0. 1. 0. 1.]]
Hidden States: [[1. 1. 0.]]


# Deep Belief Network (DBN)

Objective:

    -Unsupervised learning & feature learning
    -Model complex hierarchical representations of data
Unique Features:
    
    -Multiple layer of stochastic, latent variables (usually binary)
    -Stack of Restricted Boltzmann Machines (See super in the DBN class init for implementing via making DBN layers)
    - Uses a layer-wise pre-training approach followed by fine-tuning using backprop
    
Common Uses:
    
    - Feature Learning
    - Generative tasks (new samples from learned distribution)

In [43]:
class DBN(BaseNN):
    def __init__(self, visible_size, hidden_sizes):
        super(DBN, self).__init__(visible_size, None, None)
        self.rbm_layers = nn.ModuleList([RBM(visible_size, hidden_size) for hidden_size in hidden_sizes])

    def forward(self, x):
        # Forward pass through each RBM layer
        for rbm_layer in self.rbm_layers:
            x, _ = rbm_layer(x)
        return x

In [44]:
# Example usage
visible_size = 5
hidden_sizes = [5, 1]

# Create a Deep Belief Network
dbn_model = DBN(visible_size, hidden_sizes)

# Define a sample input
sample_input = torch.rand((1, visible_size))

# Forward pass through the DBN
output_dbn = dbn_model(sample_input)

# Print the model architecture and output
print(dbn_model)
print("Output:", output_dbn.detach().numpy())

DBN(
  (rbm_layers): ModuleList(
    (0-1): 2 x RBM()
  )
)
Output: [[0.]]


# Deep Convolutional Network (DCN)

Objective:
    
    -Processing structured grid data like images
    -Excels @ capturing hierarchical spatial patterns

Unique Features:
    
    -Uses convolutional layers w/ learnable filters that capture local patterns
    -Typically includes pooling layers to reduce spatial dimensions & increase computational efficency
    -Uses shared weights in convolutional layers for translation invariance
    
Common uses:
    
    -Image classification
    -Feature learning in spatial data (hierarchical representations of spatial features)
    -Transfer learning (pre-trained CNNs on large datasets often fine-tuned for specific tasks)

In [45]:
class DeepCNN(BaseNN):
    def __init__(self, input_channels, num_classes, image_size):
        hidden_size = 64

        super(DeepCNN, self).__init__(input_size=image_size, hidden_size=hidden_size, output_size=num_classes)

        self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(64 * (image_size // 4) * (image_size // 4), hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64 * (self.input_size // 4) * (self.input_size // 4))
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [46]:
# Example usage
input_channels = 1  # Grayscale images
num_classes = 10
image_size = 28

deep_cnn_model = DeepCNN(input_channels, num_classes, image_size)

sample_image = torch.rand((1, input_channels, image_size, image_size))

output_scores = deep_cnn_model(sample_image)

print(deep_cnn_model)
print("Output Scores:", output_scores.detach().numpy())

DeepCNN(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=3136, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=10, bias=True)
)
Output Scores: [[-0.06619537 -0.02664012 -0.06680252 -0.03263853 -0.06080712  0.02075643
  -0.01576368 -0.11744033 -0.03575525  0.09718715]]


# Deconvolutional Network (DN)

Objective:
    
    -Reconstruction & generation of struuctured data (especially images)
    -specialize in capturing hierarchical spatial patterns

Unique Features:
    
    -deconvolutional layers w/ learnable filters for reconstructing spatial patterns
    -usually include unpooling layers to increase spatial dimensions while maintaining computational efficency
    -shared weights in deconvolutional layers to introduce translation invariance during reconstruction
    
Common uses:
    
    -image reconstruction & generation
    -feature learning in spatial data w/ focus on capturing hierarchical spatial patterns
    -semantic segmentation in images
    -inverse problems (ex.image restoration)
    -transfer learning (pre-trained on larger dataset -> fine-tuned for specicific reconstruction task)

In [47]:
class DeepDeconvNet(BaseNN):
    def __init__(self, input_channels, output_channels, output_size):
        hidden_size = 64

        super(DeepDeconvNet, self).__init__(input_size=None, hidden_size=hidden_size, output_size=output_size)

        self.fc1 = nn.Linear(hidden_size, 64 * (output_size // 4) * (output_size // 4))
        self.deconv1 = nn.ConvTranspose2d(in_channels=64, out_channels=32, kernel_size=3, stride=2, padding=1, output_padding=1)
        self.deconv2 = nn.ConvTranspose2d(in_channels=32, out_channels=output_channels, kernel_size=3, stride=2, padding=1, output_padding=1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = x.view(-1, 64, (self.output_size // 4), (self.output_size // 4))
        x = F.relu(self.deconv1(x))
        x = F.sigmoid(self.deconv2(x))
        return x

notes on parameters above: 

    channels: in context of images, 1=grayscale, 3=RGB
    kernel_size: size of convolutional kernel-filter, size of local region considered for each convolutional operation
    stride: step-size
    padding: zero-padding addied to input of each side, helps maintain/adjust spatial dimensions
    output_size: shape of output data

In [48]:
# Example usage
input_channels = 1  # Grayscale images
output_channels = 3  # Number of channels in the output image (e.g., RGB)
output_size = 28

deep_deconv_model = DeepDeconvNet(input_channels, output_channels, output_size)

sample_latent_vector = torch.rand((1, deep_deconv_model.hidden_size))

output_image = deep_deconv_model(sample_latent_vector)

print(deep_deconv_model)
print("Output Image Shape:", output_image.shape)

DeepDeconvNet(
  (fc1): Linear(in_features=64, out_features=3136, bias=True)
  (deconv1): ConvTranspose2d(64, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
  (deconv2): ConvTranspose2d(32, 3, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
)
Output Image Shape: torch.Size([1, 3, 28, 28])


# Deep Convolutional Inverse Graphics Network (DCIGN)

Objective:

    -Inverting image rendering process to understand & reconstruct 3D structure from 2D images
    -Specializes network for specific tasks involving 3D object manipulation & scene understanding 
Unique Features:

    -Combines convolutional layers for feature extraction w/ inverse graphics layers for 3D reconstruction
    -Capable of learning interpretable, manipulable representations of image elements
    -Adaptable architecture for varying levels of detail and types of 3D reconstruction
    
Common uses:
   
    -3D object reconstruction from 2D images in computer vision and graphics
    -Pose estimation for objects and characters in images
    -comprehensive scene understanding for robotics & autonomous navigation
    -applications in AR & VR for real-time image manipulation
    -Image restoration & completion

In [49]:
class DCIGN(BaseNN):
    def __init__(self, input_channels, input_size, output_size):
        hidden_size = 64 

        super(DCIGN, self).__init__(input_size=input_size, hidden_size=hidden_size, output_size=output_size)

        self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # Calculate the size of the flattened output after convolutional and pooling layers
        self.flattened_size = 64 * (input_size // 4) * (input_size // 4)

        self.fc1 = nn.Linear(self.flattened_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))

        # Flatten the output for the fully connected layers
        x = x.view(-1, self.flattened_size)

        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [50]:
input_channels = 3  # for RGB images
input_size = 32  # Example size, adjust as needed
output_size = 10  # Example output size

dcign = DCIGN(input_channels, input_size, output_size)

sample_input = torch.randn(1, input_channels, input_size, input_size)

output = dcign(sample_input)

print("Output Tensor:", output)
print("Output Shape:", output.shape)

Output Tensor: tensor([[ 0.0689, -0.0648,  0.0030, -0.1889, -0.0729,  0.0434, -0.1052,  0.0375,
         -0.0234, -0.0689]], grad_fn=<AddmmBackward0>)
Output Shape: torch.Size([1, 10])


# General Adversarial Network (GAN)

Objective:

    -Generate images from random noise through adversarial training
    - Improve generative models' performance by using 2 networks against each other (generator & discriminator) 
Unique Features:

    -Adversarial training: Uses 2 NN's (generator/discriminator) that are trained simultaneously through adversarial processes. 
        -Generator learns to produce increasingly realistic data
        -Discriminator learns to better distinguish between real & generated data
    
    -Feedback loop: Generator is updated based on the feedback from the discriminator, guiding it to product more realistic outputs
    
Common uses:
   
    -Image generation
    -Style transfer
    -Anomoly detection
    -Super resolution

## Generator Class

In [51]:
class Generator(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Generator, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

## Discriminator Class

In [52]:
class Discriminator(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Discriminator, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

## GAN class
Utilizes the generator and discriminator

In [53]:
class GAN(BaseNN):
    def __init__(self, generator, discriminator):
        super(GAN, self).__init__(input_size=None, hidden_size=None, output_size=None)
        self.generator = generator
        self.discriminator = discriminator

    def forward(self, x):
        generated_data = self.generator(x)
        discriminator_output = self.discriminator(generated_data)
        return generated_data, discriminator_output

In [54]:
# Example usage
input_size = 10
hidden_size = 128
output_size = 10

generator = Generator(input_size, hidden_size, output_size)
discriminator = Discriminator(output_size, hidden_size, 1)

gan_model = GAN(generator, discriminator)

sample_noise = torch.randn((1, input_size))

generated_data, discriminator_output = gan_model(sample_noise)

print(gan_model)
print("Generated Data:", generated_data)
print("Discriminator Output:", discriminator_output)

GAN(
  (generator): Generator(
    (fc1): Linear(in_features=10, out_features=128, bias=True)
    (fc2): Linear(in_features=128, out_features=10, bias=True)
  )
  (discriminator): Discriminator(
    (fc1): Linear(in_features=10, out_features=128, bias=True)
    (fc2): Linear(in_features=128, out_features=1, bias=True)
  )
)
Generated Data: tensor([[0.5185, 0.5771, 0.5177, 0.5572, 0.5355, 0.4523, 0.5175, 0.4826, 0.5086,
         0.4658]], grad_fn=<SigmoidBackward0>)
Discriminator Output: tensor([[0.4669]], grad_fn=<SigmoidBackward0>)


# Spiking Neural Network (SNN)


**Objective:**

- Mimic the biological processes of the human brain more closely than traditional artificial neural networks by using neurons that fire in discrete spikes.
- Process information through the timing of these spikes, enabling the network to efficiently handle spatiotemporal data and perform dynamic pattern recognition.

**Unique Features:**

- **Biologically Inspired:** Incorporates models of neurons that generate discrete spikes, a form of communication used in the biological nervous system.
- **Temporal Dynamics:** Capable of capturing and processing temporal information inherent in the input data through the sequence and timing of spikes.
- **Energy Efficiency:** Designed to be inherently more energy-efficient for certain computations, mirroring the energy efficiency seen in the human brain.
- **Learning Through Time:** Utilizes learning mechanisms such as Spike-Timing-Dependent Plasticity (STDP), allowing the network to adapt based on the timing between spikes.

**Common Uses:**

- **Neurobiological Research:** Offers a platform for exploring theories of brain function and the principles underlying neural computation.
- **Sensory Processing:** Applied in processing and interpreting data from sensory inputs, such as visual and auditory systems, in a manner similar to biological systems.
- **Edge Computing:** Ideal for deployment in edge devices due to their low power consumption, where they can perform real-time data analysis.
- **Pattern Recognition:** Utilized in tasks requiring the detection of patterns over time, such as speech recognition or gesture analysis.
- **Robotic Control:** Empowers robots with the ability to process sensory inputs in real-time, leading to more adaptive and responsive behaviors.

In [55]:
class SNN(BaseNN):
    def __init__(self, input_size, hidden_size, output_size):
        super(SNN, self).__init__(input_size, hidden_size, output_size)
        # Define a simple linear layer to simulate neuron connections
        self.linear = nn.Linear(input_size, hidden_size)
        # Spike function could be a Heaviside step function or similar
        self.spike_fn = lambda x: torch.heaviside(x - 0.5, torch.tensor([0.0]))

    def forward(self, x):
        x = self.linear(x)
        x = self.spike_fn(x)  # Simulate spiking behavior
        return x

In [56]:
# Example usage
input_size = 10
hidden_size = 20
output_size = 10  # Output size is not used in this simplified example but included for consistency with the class definition

# Instantiate the SNN model
snn_model = SNN(input_size, hidden_size, output_size)

# Generate a sample input (batch size, input size)
sample_input = torch.randn((1, input_size))

# Forward pass through the SNN
spiked_output = snn_model(sample_input)

print("Sample Input:", sample_input)
print("Spiked Output:", spiked_output)

Sample Input: tensor([[-1.0168,  0.8071,  1.0896, -0.0985, -0.7639,  0.7797, -1.1863,  1.1127,
          0.0859,  0.4825]])
Spiked Output: tensor([[1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
         0., 0.]], grad_fn=<NotImplemented>)


Output explanation: The output tensor is of size *hidden size* with values either 0 or 1, representing whether each neuron in the hidden layer fired (1) or did not fire(0) based on the simplified spike function.

This example demonstrates the instantation and baic usage, however it is a highly abstracted version for practical implementation and scope limitations.

# Liquid State Machine (LSM)

Objective:
    
    -Process time-varying inputs
    -Utilize high-dimensional transient states (liquid states) induced by input stimuli for copmutation allowing the network to perform temporal pattern regonition and time-series prediction

Unique Features:
    
    -Utilizes a network of spiking neurons to create a responsive state to input stimuli
    -Specializes in handling sequences and temporal patterns
    -Flexible readout layer interprets the reservoir's state for varied tasks

Common Uses:
    
    -Neurobiological simulations and understanding brain functions
    Speech and gesture recognition for interactive systems
    -Time-series forcasting in finance & weather prediction
    -Robotic sensory processing for adaptive control
    -Biometric authentication through pattern analysis

In [57]:
class LSM(SNN):
    def __init__(self, input_size, reservoir_size, output_size):
        super(LSM, self).__init__(input_size, reservoir_size, output_size)
        # In a true LSM, the reservoir would be more complex and involve dynamic connections.
        # Here, we simulate it with a single RNN layer for simplicity.
        self.reservoir = nn.RNN(input_size, reservoir_size, batch_first=True)
        # The readout layer
        self.readout = nn.Linear(reservoir_size, output_size)

    def forward(self, x):
        # Process input through the simplified 'reservoir'
        reservoir_state, _ = self.reservoir(x)
        
        # Assuming the last state as the representation
        reservoir_state = reservoir_state[:, -1, :]
        output = self.readout(reservoir_state)
        return output

In [58]:
# Example usage
input_size = 10
reservoir_size = 128
output_size = 1

# Instantiate the LSM model
lsm_model = LSM(input_size, reservoir_size, output_size)

# Generate a sample input (batch size, sequence length, input size)
# Let's create a batch of 5 sequences, each of length 7 (time steps) with 10 features
sample_input = torch.randn((5, 7, input_size))

# Forward pass through the LSM
output = lsm_model(sample_input)

print("LSM Model:", lsm_model)
print("Output Shape:", output.shape)
print("Output:", output)

LSM Model: LSM(
  (linear): Linear(in_features=10, out_features=128, bias=True)
  (reservoir): RNN(10, 128, batch_first=True)
  (readout): Linear(in_features=128, out_features=1, bias=True)
)
Output Shape: torch.Size([5, 1])
Output: tensor([[ 0.0650],
        [-0.2211],
        [-0.1590],
        [-0.2507],
        [ 0.0578]], grad_fn=<AddmmBackward0>)


### Expected Output and Understanding

- **LSM Model:** This print statement will display the structure of the LSM model, including the RNN (reservoir) and the readout linear layer.
- **Output Shape:** Since the readout layer's output size is 1, and you're processing a batch of 5 sequences, the output shape should be `[5, 1]`, indicating that for each sequence in the batch, you get a single output value.
- **Output:** This will show the actual output values from the LSM. These values are generated by processing the synthetic sequential data through the LSM's reservoir and readout layer.

This example is a straightforward demonstration meant to illustrate how you might set up and use an LSM model with PyTorch for sequence processing tasks. The synthetic data doesn't represent a specific real-world problem, but in practice, you could adapt this setup to work on tasks like time-series forecasting, sequence classification, or any problem where understanding temporal dynamics is crucial.

# Extreme Learning Machine

**High-Level Overview**

Extreme Learning Machines (ELMs) represent an innovative class of single-hidden layer feedforward neural networks (SLFNs) that streamline the learning process by randomly assigning input weights and biases, focusing instead on analytically determining the output weights. This unique approach reduces training complexity and time, making ELMs particularly suitable for rapid prototyping and handling large or noisy datasets efficiently.

**Data Type**

ELMs are versatile, capable of processing:
- Numerical
- Time-series
- Images
- Continuous data

This adaptability makes them applicable across a broad spectrum of data-intensive fields.

**Task Objective**

ELMs excel in:
- Classification
- Regression
- Feature Learning

Their fast learning speed and high efficiency position them as a powerful tool for both predictive modeling and data representation tasks.

**Scalability**

ELMs demonstrate remarkable scalability, efficiently managing large datasets and complex models with adjustable hidden nodes. This attribute is pivotal for applications in big data, where the volume and dimensionality of data can significantly impact computational performance.

**Robustness to Noise**

One of the standout features of ELMs is their robustness to noise, making them exceptionally reliable in real-world scenarios where data quality may vary. This robustness ensures that ELMs maintain high performance even when data is imperfect or incomplete.

**Implementation Variants**

Several variants of ELMs have been developed to cater to specific needs, including:
- **Kernel ELM (KELM):** Offers enhanced capabilities for non-linear problem solving.
- **Online Sequential ELM (OSELM):** Ideal for dynamic environments where data is available in sequences or streams.

**Practical Application Guidance**

**When to Use ELMs:**
- Rapid model development is required.
- Dealing with large or noisy datasets.
- The task involves linear or non-linear problems where quick training is beneficial.

**Considerations:**
- While ELMs offer significant advantages in terms of speed and simplicity, they may not be the best fit for tasks requiring deep interpretability of model decisions. 
- For highly unstructured data, such as raw text or images that necessitate deep learning techniques, exploring other neural network models might yield better results.

### Conclusion

Extreme Learning Machines offer a unique combination of speed, efficiency, and versatility, making them a valuable addition to the neural network toolkit. By understanding their capabilities, implementation variants, and practical applications, researchers and practitioners can effectively leverage ELMs to address a wide range of challenges in data analysis and predictive modeling.

In [59]:
class ELM(BaseNN):
    def __init__(self, input_size, hidden_size, output_size):
        super(ELM, self).__init__(input_size, hidden_size, output_size)
        
        # Initialize random weights for the hidden layer
        self.hidden_weights = torch.randn(input_size, hidden_size)
        self.hidden_bias = torch.randn(hidden_size)
        
        # No learning required for hidden layer, so no need for parameters
        for param in self.parameters():
            param.requires_grad = False
        
        # Linear output layer
        self.output_layer = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        # Calculate the output of the hidden layer
        hidden_output = torch.matmul(x, self.hidden_weights) + self.hidden_bias
        hidden_output = F.relu(hidden_output)  # Apply ReLU activation function
        
        # Pass the hidden layer output through the output layer
        output = self.output_layer(hidden_output)
        return output

In [60]:
# Example usage
input_size = 2
hidden_size = 3
output_size = 1

elm_model = ELM(input_size, hidden_size, output_size)
sample_input = torch.randn((1, input_size))

output_prediction = elm_model(sample_input)
print("Sample Input:", sample_input)
print("Output Prediction:", output_prediction.detach().numpy())

Sample Input: tensor([[-1.4513, -2.6494]])
Output Prediction: [[0.90435386]]


# Echo State Network (ESN)

**High-Level Overview**

Echo State Networks (ESNs) belong to the reservoir computing family, distinguished for their novel approach to training recurrent neural networks (RNNs). ESNs simplify the training process by keeping the internal connections of the network (the "reservoir") fixed while only adjusting the weights of the output layer. This architecture enables ESNs to efficiently process temporal or sequential data, making them particularly adept at tasks requiring memory of past inputs, such as time-series forecasting and sequence modeling.

**Data Type**

Echo State Networks are tailored for:
- Time-series data
- Sequential data
- Temporal patterns in audio and speech

Their specialization in handling sequences makes them versatile tools for dynamic systems analysis and prediction.

**Task Objective**

ESNs excel in / are suited for:
- Time-series forecasting from financial markets, environmental sensors, or health monitoring.
- Sequential data in natural language processing or music generation.
- Dynamic system modeling
- Temporal patterns in audio and speech recognition, benefiting from ESNs' capacity to handle varying lengths of sequences.

The ability of ESNs to retain information over time and their sensitivity to initial conditions enable complex temporal pattern recognition and prediction capabilities.

**Scalability**

Thanks to their fixed reservoir, ESNs efficiently manage large datasets and sequences without the computational burden typical of fully trainable RNNs. This attribute allows ESNs to excel in tasks with long dependencies, providing scalability and performance advantages.

**Robustness to Noise**

Echo State Networks demonstrate considerable robustness to noise in input data, thanks to the reservoir's capacity to absorb and process noisy signals without significant degradation in performance. This characteristic is invaluable in real-world applications where data often comes with a degree of uncertainty and variability.

**Implementation Variants**

Variations of Echo State Networks have been developed to optimize performance across different domains, including:
- **Leaky Integrator ESNs:** Introduce a mechanism to control the speed at which the reservoir's state updates, improving memory and stability.
- **Deep ESNs:** Stack multiple reservoirs to enhance representational capacity and capture more complex patterns in data.

**Practical Application Guidance**

**When to Use Echo State Networks:**
- In scenarios where capturing long-term dependencies in sequential data is crucial.
- For tasks that benefit from quick, efficient training of recurrent structures.
- When working with noisy time-series or dynamic systems requiring predictive modeling.

**Considerations:**
- While ESNs offer a compelling approach for specific sequential tasks, their performance is heavily dependent on the reservoir's configuration and the quality of the output layer's training.
- Ensuring the reservoir possesses the "echo state property" is critical for model stability and effectiveness, requiring careful tuning of parameters such as connectivity and spectral radius.

### Conclusion

Echo State Networks provide a powerful yet efficient solution for modeling and predicting temporal sequences, standing out for their simplicity and performance in sequence-related tasks. Their design philosophy, emphasizing the "echo state property" and focusing training on the output layer, offers a unique approach to recurrent neural network training. By leveraging ESNs, practitioners can tackle a wide array of challenges in temporal data analysis, benefiting from their robustness to noise and their capacity to capture complex dynamic behaviors.

In [61]:
class ESN(nn.Module):
    def __init__(self, input_size, reservoir_size, output_size):
        super(ESN, self).__init__()
        # Initialize weights
        self.W_in = nn.Parameter(torch.randn(input_size, reservoir_size) * 0.1)  # Scaled for stability
        self.W_res = nn.Parameter(torch.randn(reservoir_size, reservoir_size) * 0.1)
        self.W_out = nn.Parameter(torch.randn(reservoir_size, output_size) * 0.1)
        
        # Ensure the echo state property through spectral radius adjustment
        spectral_radius = torch.max(abs(torch.linalg.eigvals(self.W_res)))
        self.W_res.data /= spectral_radius / 0.95  # Adjust to slightly below 1 for stability
        
        # Initialize the reservoir state
        self.reservoir_state = torch.zeros(1, reservoir_size)

    def forward(self, x):
        # Update the reservoir with current input and previous state
        self.reservoir_state = torch.tanh(x @ self.W_in + self.reservoir_state @ self.W_res)
        
        # Compute the output
        output = self.reservoir_state @ self.W_out
        return output

In [62]:
# Example usage
input_size = 2
reservoir_size = 5
output_size = 1

esn_model = ESN(input_size, reservoir_size, output_size)
sample_input = torch.randn((1, input_size))

output_prediction = esn_model(sample_input)
print("Sample Input:", sample_input)
print("Output Prediction:", output_prediction.detach().numpy())

Sample Input: tensor([[-0.4273, -0.6233]])
Output Prediction: [[-0.01071815]]


# Deep Residual Network (DRN) (ResNet)

### Deep Residual Networks: A Detailed Exploration

**High-Level Overview**

Deep Residual Networks (ResNets) revolutionized deep learning architectures through the introduction of "residual blocks," allowing models to learn identity functions and effectively address the vanishing gradient problem. This innovation enables the training of networks that are much deeper than previously possible, significantly improving performance on tasks requiring the recognition of complex patterns, such as image classification and object detection.

**Data Type**

ResNets are particularly effective with:
- Images
- Video data

Their design makes them a powerhouse in computer vision applications, from basic image recognition to complex video analysis tasks.

**Task Objective**

Deep Residual Networks excel in:
- Image classification
- Object detection
- Semantic segmentation
- Face recognition

By enabling the training of deeper networks without degradation, ResNets push the boundaries of what's achievable in visual recognition tasks.

**Scalability**

The architecture of ResNets, characterized by residual blocks that skip one or more layers, directly addresses the scalability issue in deep learning. This allows ResNets to scale to hundreds or even thousands of layers while maintaining or even improving performance, a feat unattainable by traditional deep neural networks.

**Robustness to Noise**

ResNets show remarkable robustness to noise in the input data, thanks to their deep and complex architectures capable of extracting essential features while discarding irrelevant information. This robustness is critical for real-world applications where data imperfections are common.

**Implementation Variants**

Several variants of Deep Residual Networks have been developed to optimize performance across different domains and tasks, including:
- **ResNet-50, ResNet-101, and ResNet-152:** Variations with different depths, balancing computational efficiency and performance.
- **ResNeXt:** Introduces a "cardinality" dimension as an alternative to going deeper or wider, further improving model performance.

**Practical Application Guidance**

**When to Use Deep Residual Networks:**
- In tasks that benefit from deep architectures, such as complex image and video analysis.
- When looking for models that provide state-of-the-art results in computer vision.

**Considerations:**
- While ResNets significantly mitigate the vanishing gradient problem, they also increase computational complexity. Efficient hardware or optimization techniques may be necessary for training.
- The choice among ResNet variants depends on the specific requirements of the task, including computational resources and performance targets.

### Conclusion

Deep Residual Networks represent a major milestone in the evolution of neural network architectures, offering unparalleled depth and performance in a wide range of computer vision tasks. Their innovative design not only tackles longstanding challenges in training deep networks but also sets a new standard for what's achievable in machine learning and artificial intelligence.

In [63]:
# Residual Block
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(ResidualBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsample = downsample

    def forward(self, x):
        identity = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.downsample is not None:
            identity = self.downsample(x)
        out += identity
        out = self.relu(out)
        return out

In [64]:
class ResNet(BaseNN):
    def __init__(self, input_size, hidden_size, output_size, num_blocks):
        super(ResNet, self).__init__(input_size, hidden_size, output_size)
        self.in_channels = 64
        self.conv = nn.Conv2d(3, self.in_channels, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn = nn.BatchNorm2d(self.in_channels)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(self.in_channels, hidden_size, num_blocks, stride=1)
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))

        # The number of output channels from the last block will be `hidden_size`
        # Adjust this based on architecture
        final_out_channels = hidden_size

        # Initialize the fully connected layer with the correct number of input features
        self.fc = nn.Linear(final_out_channels, output_size)

    def _make_layer(self, in_channels, out_channels, num_blocks, stride):
        layers = []
        downsample = None
        if stride != 1 or in_channels != out_channels:
            downsample = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )
        layers.append(ResidualBlock(in_channels, out_channels, stride, downsample))
        for _ in range(1, num_blocks):
            layers.append(ResidualBlock(out_channels, out_channels))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

In [65]:
# Example Usage
# Instantiate the ResNet model
resnet_model = ResNet(input_size=3, hidden_size=64, output_size=10, num_blocks=2)

# Define a sample input tensor
sample_input = torch.rand((4, 3, 224, 224))  # Batch size, Channels, Height, Width

# Forward pass to get the output
output_resnet = resnet_model(sample_input)

# Print the model architecture and output
print(resnet_model)
print("Output shape:", output_resnet.shape)

ResNet(
  (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): ResidualBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): ResidualBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True

# Kohonen Networks (KN) / Self-Organizing Feature Map (SOFM)

**High-Level Overview**

Kohonen Networks, also known as Self-Organizing Maps (SOMs), are a type of artificial neural network that uses unsupervised learning to produce a two-dimensional, discretized representation of the input space of the training samples. Developed by Teuvo Kohonen in the 1980s, these networks are particularly effective at preserving the topological properties of the input space, making them powerful tools for visualizing high-dimensional data in lower-dimensional spaces.

**Data Type**

SOMs can process a wide range of data types, including:
- Numerical data
- High-dimensional datasets
- Multivariate data

Their ability to handle high-dimensional data makes them suitable for complex pattern recognition tasks across various domains.

**Task Objective**

Kohonen Networks are utilized for:
- Dimensionality reduction
- Clustering and visualization of high-dimensional data
- Feature mapping and extraction
- Anomaly detection in datasets

By organizing data into clusters that preserve the topological properties, SOMs facilitate a deeper understanding of data structure and relationships.

**Scalability**

SOMs are relatively scalable to large datasets, with their computational complexity mainly depending on the size of the map and the dimensionality of the input data. However, the training time can increase significantly with the size of the input space and the resolution of the map.

**Robustness to Noise**

SOMs demonstrate a good level of robustness to noise due to their competitive learning process, where only the neuron most similar to the input (and its neighbors) is updated. This local update mechanism allows SOMs to smooth out noise and focus on the underlying patterns in the data.

**Implementation Variants**

There are several variants of Kohonen Networks, including:
- **Batch SOMs:** Update the map using all training data at once, rather than one sample at a time, often leading to faster convergence.
- **Toroidal SOMs:** Implement boundary conditions that wrap around the map, useful for cyclic or continuous data.
- **Growing SOMs:** Dynamically adjust the size of the map during training to better fit the data.

**Practical Application Guidance**

**When to Use Kohonen Networks:**
- For exploratory data analysis to visualize and understand the structure of high-dimensional data.
- In clustering tasks where the preservation of input space topology is desired.
- For feature extraction and reduction before applying other machine learning models.

**Considerations:**
- The choice of map size and learning parameters can significantly affect the quality of the resulting SOM and requires careful tuning.
- While SOMs provide valuable insights into data structure, they may not directly optimize for specific predictive tasks and are often used in a complementary role with other models.

### Conclusion

Kohonen Networks or Self-Organizing Maps offer a unique approach to the unsupervised learning of high-dimensional data, providing a structured way to visualize and cluster complex datasets. By preserving the topological and metric relationships of the input space, SOMs serve as a valuable tool in the data scientist's toolkit, especially for tasks involving data exploration and understanding.

In [66]:
class SOFM(nn.Module):
    def __init__(self, input_dim, map_size, lr=0.1, sigma=None):
        super(SOFM, self).__init__()
        self.map_size = map_size
        self.lr = lr
        self.sigma = sigma if sigma is not None else max(map_size) / 2  # Initial radius
        self.weight = nn.Parameter(torch.randn(map_size[0], map_size[1], input_dim))
        self.locations = torch.tensor(np.array([[i, j] for i in range(map_size[0]) for j in range(map_size[1])])).float()

    def forward(self, x):
        # Ensure x is compatible with weight dimensions for broadcasting
        x = x.view(1, 1, -1)  # Reshape x to [1, 1, input_dim]

        # Calculate square difference
        sq_diff = torch.sum((self.weight - x) ** 2, dim=2)

        # Find the best matching unit (BMU)
        _, bmu_idx = torch.min(sq_diff.view(-1), dim=0)  # Flatten sq_diff and find index of BMU

        # Retrieve the BMU location
        bmu_location = self.locations[bmu_idx]  # Correctly index into self.locations

        # Calculate distance squared from all neurons to BMU
        distance_sq = torch.sum((self.locations - bmu_location) ** 2, dim=1)
        lr = self.lr * torch.exp(-distance_sq / (2 * self.sigma ** 2))  # Adjust learning rate based on distance

        # Apply learning rate to weight update
        # Ensure lr is correctly shaped for broadcasting
        lr = lr.view(self.map_size[0], self.map_size[1], 1)
        weight_update = lr * (x - self.weight)

        # Update the weights
        self.weight.data += weight_update

        return bmu_idx



    def train_sofm(self, data, epochs):
        for epoch in range(epochs):
            for x in data:
                self.forward(x)
            # Decay learning parameters
            self.lr *= 0.995  # Learning rate decay
            self.sigma *= 0.995  # Radius decay

In [67]:
# Example usage
input_dim = 3  # Dimensionality of input data
map_size = (10, 10)  # Size of the SOFM map

sofm = SOFM(input_dim=input_dim, map_size=map_size)
data = torch.rand(100, input_dim)  # Example dataset

sofm.train_sofm(data, epochs=100)

# Support Vector Machine (SVM)

### Support Vector Machines: A Comprehensive Analysis

**High-Level Overview**

Support Vector Machines (SVMs) are a powerful class of supervised learning models used for classification and regression tasks. Developed in the 1960s and refined in the 1990s, SVMs are based on the concept of finding the hyperplane that best separates different classes in the feature space. By maximizing the margin between the nearest points of the classes (support vectors), SVMs achieve high generalization ability, making them highly effective for a wide range of pattern recognition tasks.

**Data Type**

SVMs are versatile and can process:
- Numerical data
- Categorical data (after encoding)
- Text data (using TF-IDF or word embeddings)
- Image data (using feature extraction techniques)

This flexibility allows SVMs to be applied in various domains, from document classification to image recognition.

**Task Objective**

SVMs are primarily used for:
- Binary classification
- Multiclass classification
- Regression tasks
- Outlier detection

Their effectiveness in high-dimensional spaces and with complex decision boundaries makes them suitable for tasks requiring precise and robust classification and regression models.

**Scalability**

While SVMs perform exceptionally well on small to medium-sized datasets, their computational complexity can become a challenge with very large datasets or extremely high-dimensional spaces. Kernel tricks and dimensionality reduction techniques are often used to mitigate these challenges and enhance scalability.

**Robustness to Noise**

SVMs exhibit a significant level of robustness to noise and overfitting, especially in scenarios where the margin is maximized with a correct choice of the regularization parameter. Their reliance on support vectors (the most informative data points) rather than the entire dataset contributes to their resilience.

**Implementation Variants**

Several variants of SVMs exist to cater to specific needs, including:
- **Linear SVMs:** Best suited for linearly separable data.
- **Kernel SVMs:** Use kernel functions to operate in a transformed feature space, allowing them to handle non-linear data.
- **Nu-SVMs and C-SVMs:** Offer different formulations for the optimization problem, providing flexibility in controlling the trade-off between margin size and misclassification error.

**Practical Application Guidance**

**When to Use SVMs:**
- In binary or multiclass classification problems with clear margin separation.
- For text and image classification tasks where high-dimensional feature spaces are common.
- In applications where model interpretability is important, as the support vectors and the decision boundary provide insights into the model's predictions.

**Considerations:**
- Choosing the right kernel and tuning hyperparameters (like C and gamma) are crucial steps that significantly impact SVM performance.
- SVMs may require more preprocessing effort, such as normalization and encoding, to ensure optimal model training.

### Conclusion

Support Vector Machines stand out for their robustness, versatility, and efficacy in handling classification and regression tasks across a broad spectrum of domains. By effectively navigating the trade-offs between complexity and performance, practitioners can leverage SVMs to build highly accurate, generalizable models for a diverse array of challenges in machine learning and pattern recognition.

In [68]:
class LinearSVM(nn.Module):
    def __init__(self, input_size, output_size):
        super(LinearSVM, self).__init__()
        self.fc = nn.Linear(input_size, output_size)  # Output size is 1 for binary classification

    def forward(self, x):
        return self.fc(x)

def hinge_loss(y_pred, y_true):
    return torch.mean(torch.clamp(1 - y_pred * y_true, min=0))

Hinge Loss: A custom function that computes the hinge loss, encouraging the model to not only correctly classify training samples but also to maximize the margin between classes.

In [69]:
# Example usage
input_size = 2  # Number of features
output_size = 1  # Binary classification

# Generate synthetic data for binary classification
# Class 0: centered at (0.5, 0.5), Class 1: centered at (1.5, 1.5)
n_samples = 100
x_class0 = torch.rand((n_samples//2, input_size)) * 0.5
y_class0 = torch.zeros(n_samples//2, 1)
x_class1 = torch.rand((n_samples//2, input_size)) * 0.5 + 1
y_class1 = torch.ones(n_samples//2, 1)

x_train = torch.cat([x_class0, x_class1], dim=0)
y_train = torch.cat([y_class0, y_class1], dim=0)

# Initialize SVM model
svm_model = LinearSVM(input_size, output_size)

# Optimizer
optimizer = optim.SGD(svm_model.parameters(), lr=0.02)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    # Forward pass
    outputs = svm_model(x_train).squeeze()  # Ensure output matches dimension of y_train
    labels = 2 * y_train.squeeze() - 1  # Convert labels to {-1, 1}
    loss = hinge_loss(outputs, labels)

    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Test the model
with torch.no_grad():
    y_pred = svm_model(x_train).squeeze()
    y_pred_labels = (y_pred > 0).float()
    accuracy = (y_pred_labels == y_train.squeeze()).float().mean()
    print(f'Accuracy: {accuracy:.4f}')

Epoch [10/100], Loss: 0.8118
Epoch [20/100], Loss: 0.7104
Epoch [30/100], Loss: 0.6596
Epoch [40/100], Loss: 0.6342
Epoch [50/100], Loss: 0.6113
Epoch [60/100], Loss: 0.5884
Epoch [70/100], Loss: 0.5655
Epoch [80/100], Loss: 0.5426
Epoch [90/100], Loss: 0.5198
Epoch [100/100], Loss: 0.4969
Accuracy: 0.7900


#### Simplifications and Disclaimers
Linear Decision Boundary: This example focuses on a linear SVM, suitable for linearly separable data. Real-world datasets often require more complex models, such as kernel SVMs, to capture non-linear relationships.
Gradient Descent Optimization: While traditional SVMs are typically trained using quadratic programming solvers to directly solve the convex optimization problem, this example employs gradient descent for simplicity and educational clarity.
Feature Space and Data Type: The demonstration uses synthetic numerical data. In practice, SVMs can be applied to a wide range of data types, including categorical and text data, often requiring preprocessing steps like encoding and feature extraction.
Performance Metrics: The example primarily evaluates the model based on accuracy. Comprehensive model evaluation might include additional metrics and validation techniques to assess performance thoroughly.

# Neural Turing Machine

**High-Level Overview**

Neural Turing Machines (NTMs) introduced an innovative blend of neural networks with an external memory component, reminiscent of traditional Turing machines. This combination allows NTMs to perform data processing while also having the capability to store and retrieve information dynamically. Initially proposed to enhance problem-solving and data manipulation, NTMs marked a significant evolution in the quest to imbue neural networks with memory and learning flexibility similar to human cognition.

**Data Type**

Neural Turing Machines are adept at handling diverse data types, including:
- Sequential data
- Numerical data
- Symbolic data

This versatility stems from their unique memory system, enabling them to manage tasks that necessitate a nuanced understanding of sequences and historical context.

**Task Objective**

NTMs have shown promise in various applications, such as:
- Sequence prediction and generation
- Executing simple algorithms (e.g., sorting, copying)
- Complex problem-solving requiring memory and deduction

Despite their potential, the practical adoption of NTMs in solving broad AI challenges remains exploratory, with newer models and architectures often preferred for specific applications.

**Scalability**

The scalability of Neural Turing Machines is influenced by the neural network's architecture and the external memory's design. While theoretically capable of handling tasks with extensive memory requirements, practical implementations face challenges related to training complexity and computational demands.

**Robustness to Noise**

NTMs offer a degree of robustness to noise, leveraging the neural network's capacity for pattern recognition alongside structured memory access to mitigate the effects of noisy inputs. However, the intricacies of memory management can introduce complexities in achieving consistent performance across varied tasks.

**Implementation Variants**

Advancements in the concept of NTMs have led to the development of variants such as:
- **Differentiable Neural Computers (DNCs):** DNCs build on NTMs by refining memory access mechanisms, aiming for greater efficiency and applicability.
- **Memory Networks:** While not direct descendants, Memory Networks share the ethos of integrating memory with neural processing, tailored for tasks requiring reasoning and long-term context.

**Practical Application Guidance**

**When to Explore Neural Turing Machines:**
- For academic and research endeavors focused on enhancing neural networks with memory capabilities.
- In experimental projects aiming to understand and innovate on the integration of external memory systems with machine learning models.

**Considerations:**
- The complexity of NTMs and their variants necessitates a deep understanding of both neural networks and memory systems, making them more suited for research than immediate practical application.
- Given the rapid advancement in AI, practitioners often turn to more recent developments tailored to specific use cases, such as Transformer models for sequence understanding and GANs for generative tasks.

### Conclusion

While Neural Turing Machines represent a pivotal step towards more intelligent and flexible AI systems, their role in current neural network research has evolved. Today, NTMs are appreciated more for their conceptual contributions to integrating memory with neural processing rather than as a go-to solution for practical applications. As AI continues to advance, the principles underlying NTMs inspire ongoing innovations in creating models that more closely mimic human cognitive abilities.

In [70]:
class NTM(nn.Module):
    def __init__(self, input_size, output_size, controller_size, memory_units, memory_unit_size):
        super(NTM, self).__init__()
        self.controller = nn.LSTM(input_size, controller_size, batch_first=True)  # Note: batch_first=True
        self.memory = torch.zeros(memory_units, memory_unit_size)  # Initialize memory
        
        # Parameters for read and write heads (simplified)
        self.read_weights = nn.Parameter(torch.randn(memory_units))
        self.write_weights = nn.Parameter(torch.randn(memory_units))
        self.read_vector = nn.Parameter(torch.randn(memory_unit_size))
        self.output_layer = nn.Linear(controller_size + memory_unit_size, output_size)
        
    def read_from_memory(self):
        # Simplified read mechanism: weighted sum of memory content
        return torch.matmul(self.read_weights, self.memory)
    
    def write_to_memory(self, write_vector):
        # Simplified write mechanism: update memory with new content
        self.memory += torch.outer(self.write_weights, write_vector)
        
    def forward(self, x):
        # Controller processing. Note: Assuming x is of shape [batch, seq_len, input_size]
        controller_output, _ = self.controller(x)
        
        # For simplicity, assuming we're dealing with single time-step sequences or interested in the last time step
        # Adjusting controller_output to ensure it's [batch, controller_size]
        controller_output = controller_output[:, -1, :]
        
        # Read from memory
        read_vector = self.read_from_memory()  # This will be [memory_units, memory_unit_size]
        # Ensuring read_vector is [batch, memory_unit_size] for batch processing
        read_vector = read_vector.unsqueeze(0).expand(controller_output.size(0), -1)  # Expand to match batch size
        
        # Prepare controller output and read vector for final output
        combined_output = torch.cat([controller_output, read_vector], dim=1)
        final_output = self.output_layer(combined_output)
        
        # Example write operation (could be based on controller output or other logic)
        write_vector = torch.randn_like(read_vector[0])  # Assuming write_vector needs to match a single read_vector
        self.write_to_memory(write_vector)
        
        return final_output

In [71]:
# Initialize NTM parameters
input_dim = 4
output_dim = 2
controller_size = 10
memory_units = 20
memory_unit_size = 5

# Instantiate the NTM model
ntm_model = NTM(input_dim, output_dim, controller_size, memory_units, memory_unit_size)

# Generate a sample input sequence
sample_input = torch.randn(10, 1, input_dim)  # Example: 10 sequences, each with 1 timestep and input_dim features

# Forward pass through the NTM
output_sequences = [ntm_model(seq.unsqueeze(0)) for seq in sample_input]  # Note the addition of unsqueeze(0) for batch dimension

# Convert output sequences to a tensor
output_tensor = torch.cat(output_sequences, dim=0)  # Concatenating along the batch dimension

In [72]:
print("Output Tensor Shape:", output_tensor.shape)

Output Tensor Shape: torch.Size([10, 2])


In [73]:
print(sample_input)

tensor([[[ 1.0000, -0.9340, -0.0189,  0.1921]],

        [[ 0.5453, -0.6682,  0.6663, -0.8751]],

        [[ 0.7785, -0.0538,  0.0947, -0.9699]],

        [[ 0.2586, -1.4413,  0.9093, -0.1881]],

        [[ 0.2230, -0.1907, -0.4377,  0.2058]],

        [[-0.0976,  0.1953,  1.5992,  1.0664]],

        [[ 0.3100, -2.2921, -0.7467, -0.9357]],

        [[-0.5578,  1.2146, -0.5251,  0.5620]],

        [[-0.1448, -0.1506, -1.3177,  0.1048]],

        [[ 0.1158, -1.2906, -1.9126,  0.3576]]])


In [74]:
print(output_tensor)

tensor([[ 0.1866, -0.2697],
        [ 0.1113, -0.3211],
        [ 0.1246, -0.2864],
        [ 0.1439, -0.2877],
        [ 0.1093, -0.1458],
        [ 0.1600, -0.3097],
        [ 0.1537, -0.3854],
        [ 0.3131, -0.5121],
        [ 0.3009, -0.4569],
        [ 0.3299, -0.4673]], grad_fn=<CatBackward0>)
