# Dense Neural Network (DNN)
##### Author: Chenyang Skylar Li

# Table of Contents
1. [Introduction](#introduction)
2. [Mathematical Fundations](#mathematical-fundations)
    - [Classification and Regression Trees (CART)](#classification-and-regression-trees-cart)
3. [Learning Algorithm](#learning-algorithm)
    - [Use DecisionTreeClassifier() class provided by Scikit-Learn in wine dataset](#use-decisiontreeclassifier-class-provided-by-scikit-learn-in-wine-dataset)
    - [Use DecisionTreeRegressor() class provided by Scikit-Learn in california_housing dataset](#use-decisiontreeregressor-class-provided-by-scikit-learn-in-california_housing-dataset)
4. [Pros and Cons](#pros-and-cons)
5. [Suitable Tasks and Datasets](#suitable-tasks-and-datasets)
6. [References](#references)


## Introduction

Dense Neural Networks, also known as fully connected networks, are the simplest kind of artificial neural network. A neuron in a dense layer has a connection to every neuron in the previous layer. The concept of artificial neural networks dates back to the 1940s with the development of the perceptron model. Over the years, the technology has evolved with the advent of multi-layer perceptrons and deep learning.

## Mathematical Foundations

A Dense Neural Network (DNN) is a type of artificial neural network where each neuron in a layer is connected to all neurons in the previous and next layers. [Perceptron](perceptron.ipynb), [Linear Regression](./linear_regression.ipynb), and [Logistic Regression](./logistic_regression.ipynb), can be thought of as special cases of a neural "network", a single neuron with an appropriate activation function (or lack thereof in the case of Linear Regression). The power of a DNN comes from adding more layers of neurons and more neurons per layer, which allows the network to learn and represent more complex functions.


In the dense neural network, a neuron in a dense layer takes the weighted sum of all its inputs, adds a bias term and then applies an activation function.

Mathematically, for a given layer $l$, the output $h^{(l)}$ of that layer is given by:

$$h^{(l)} = \sigma(W^{(l)}h^{(l-1)} + b^{(l)})$$

Where,
- $W^{(l)}$ and $b^{(l)}$ are the weights and biases for layer $l$
- $h^{(l-1)}$ is the output from the previous layer
- $\sigma$ is the activation function, including:
    - Linear: $\sigma(z) = z$, which is used in Linear Regression
    - Step: $\sigma(z) = \begin{cases} 1 & z > 0 \\ 0 & z \leq 0 \end{cases}$, which is used in Perceptron
    - Sign: $\sigma(z) = \begin{cases} 1 & z > 0 \\ -1 & z \leq 0 \end{cases}$, which is used in Perceptron
    - Sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}}$, which is used in Logistic Regression
    - Tanh: $\sigma(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$, which is used in Logistic Regression
    - Rectified Linear Unit (ReLU): $\sigma(z) = \max(0, z)$, which is used in DNNs
    - Leaky ReLU: $\sigma(z) = \max(0.01z, z)$, which is used in DNNs
    - Softmax: $\sigma(z)_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}}$, which is used in DNNs for multi-class classification

## Learning Algorithm

DNNs typically use a variant of [gradient descent](../optimization_algorithm/gradient_descent.ipynb) and `backpropagation` for learning. The algorithm iteratively adjusts the network's weights and biases to minimize a loss function.

Training a Dense Neural Network typically involves the following steps:

1. **Initialization**: Initialize the weights and biases with small random numbers. This is to break the symmetry and ensure different neurons learn different things.

2. **Forward Propagation**: For each training instance in the dataset, perform a forward pass through the network. This involves computing the weighted sums and activations of all neurons.

   For a neuron in layer $l$ with input $x$ from the previous layer, weights $w$, and bias $b$, the output $a$ (activation) is calculated as:

    $$z = w \cdot x + b$$
    $$a = \sigma(z)$$

   where $\sigma$ is the activation function (like ReLU, sigmoid, tanh, etc.), and $z$ is the weighted sum.

3. **Loss Calculation**: At the output layer, calculate the loss (or cost) for the prediction. The loss function used depends on the task at hand (e.g., mean squared error for regression tasks, cross-entropy for binary classification tasks, etc.).

4. **Backward Propagation (Backpropagation)**: Compute the gradient of the loss function with respect to each weight and bias in the network. This involves applying the chain rule to find these derivatives (gradients) from the output layer back to the input layer.

   The gradients for layer $l$ are calculated as:

    $$\delta^{(l)} = ((w^{(l+1)})^T \delta^{(l+1)}) \odot \sigma'(z^{(l)})$$
    $$\frac{\partial C}{\partial b^{(l)}_j} = \delta^{(l)}_j$$
    $$\frac{\partial C}{\partial w^{(l)}_{jk}} = a^{(l-1)}_k \delta^{(l)}_j$$
    
    Where:
    - $j$ is the index of a neuron in layer $l$.
    - $k$ is the index of a neuron in the previous layer ($l-1$).
    - $w^{(l)}{jk}$ connecting the $k^{th}$ neuron in the $(l-1)^{th}$ layer to the $j^{th}$ neuron in the $l^{th}$ layer.
    - $\frac{\partial C}{\partial w^{(l)}{jk}}$ represents the rate of change of the cost function $C$ with respect to the weight.
    - $a^{(l-1)}_k$ is the activation of the $k^{th}$ neuron in the $(l-1)^{th}$ layer.
    - $\delta^{(l)}_j$ is the error term associated with the $j^{th}$ neuron in the $l^{th}$ layer.
    
5. **Update Weights and Biases**: Update the weights and biases in the direction that decreases the loss. This is often done using a variant of Stochastic Gradient Descent (SGD). The weights $w$ and biases $b$ are updated as follows:

    $$w = w - \eta \frac{\partial C}{\partial w}$$
    $$b = b - \eta \frac{\partial C}{\partial b}$$

   where $\eta$ is the learning rate, a hyperparameter that controls how much we adjust the weights with respect to the loss gradient.

6. **Iteration**: Repeat steps 2-5 for a set number of iterations or until the network's predictions are satisfactory. This is usually done over multiple epochs, where an epoch is one complete pass through the entire training dataset.



## Pros and Cons

**Pros:**
- Can model complex, non-linear relationships
- Can handle high-dimensional data
- Scalable to large datasets

**Cons:**
- Requires a large amount of data
- Prone to overfitting
- Difficult to interpret

## Suitable Tasks and Datasets

## Suitable Tasks and Datasets for Dense Neural Networks

DNNs are versatile and can handle a variety of tasks and datasets. Here are some examples:

1. **Binary and Multiclass Classification**: Dense Neural Networks can handle binary and multiclass classification tasks quite effectively. For instance, they can be used for digit recognition, image classification, sentiment analysis, etc.

2. **Regression**: They are also suitable for regression tasks where the goal is to predict a continuous output. For example, predicting house prices, stock prices, etc.

3. **Feature Learning**: Dense Neural Networks can learn to extract useful features from raw data. This makes them suitable for tasks where handcrafted features are difficult to create.

4. **Large Datasets**: Dense Neural Networks tend to perform better on larger datasets because they have the capacity to learn complex representations.

5. **High-Dimensional Data**: Dense Neural Networks are capable of handling high-dimensional data, such as images, audio, and text. They can learn hierarchical representations of the data, which helps in capturing complex patterns.

6. **Text and Time-Series Data**: Although Recurrent Neural Networks (RNNs) or Transformers are often more suitable for sequence data, Dense Neural Networks can still be used with some success, especially when sequence length is fixed and relatively short.

It is important to note that the performance of a Dense Neural Network is dependent on the quality and the quantity of the data, the network architecture, and the choice of hyperparameters.

## References
1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). [Deep learning](http://www.deeplearningbook.org/). MIT press.
2. Keras Documentation: [Getting started with the Keras Sequential model](https://keras.io/getting-started/sequential-model-guide/)
3. Keras Documentation: [Guide to the Sequential model](https://keras.io/models/sequential/)
4. TensorFlow Documentation: [tf.keras.Sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential)
5. Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O'Reilly Media, Inc.".



In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

ModuleNotFoundError: No module named 'tensorflow'

In [None]:
# Load the Fashion MNIST dataset
(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

# Scale the pixel values to be between 0 and 1
X_train = X_train / 255.0
X_test = X_test / 255.0

# Create a dense neural network with 3 hidden layers
model = keras.Sequential([
    layers.Flatten(input_shape=(28, 28)),
    layers.Dense(128, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Train the model on the training data
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(X_test, y_test)
print('Test accuracy:', test_acc)


: 

: 