<a href="https://colab.research.google.com/github/Ivyson/Neural-Network-XOR/blob/main/Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Neural Network Basics**  

Traditional programming relies on explicitly defined rules to process information and make decisions. However, many real-world problems are too complex to be solved this way. A neural network provides an alternative approach by learning patterns from data, making it capable of adapting to new inputs without requiring human intervention at every step.  

At its core, a neural network is designed to recognize patterns by processing input data through connected layers of neurons. Each neuron takes in multiple inputs, applies weights to them to indicate their contribution to the outputs, adds a bias to shift the result, and then passes the sum through an **activation function**. This function determines whether the neuron should "fire" or remain inactive, introducing non-linearity into the model to help it learn more complicated patterns.  

The simplest neural network consists of only an **input layer** and an **output layer**, forming a **perceptron**. In this basic setup, input data is fed into the network, weighted, and processed to produce an output. This process is referred to as **Feed Forward Propagation**. However, real-world problems require more depth and complexity, which is where **hidden layers** come in. These layers sit between the input and output layers, allowing the network to **extract deeper features** from the data. With multiple hidden layers, a neural network can learn intricate relationships between inputs and outputs, making it more effective in solving complex problems like image recognition, natural language processing, and predictive analytics.   

# **Wait, But How?**  

Understanding why neural networks are necessary requires looking at how real-world data behaves. Most data-driven problems are influenced by multiple factors, and the relationships between these factors are rarely straightforward.  

Considering the problem of predicting traffic congestion in a city. Traffic levels depend on several elements—time of day, nearby companies/Factories, the number of vehicles on the road, and even weather conditions. A traditional rule-based system would require hardcoded conditions to define how each of these factors contributes to congestion. However, such an approach would be inefficient and difficult to maintain, as traffic patterns are dynamic and ever-changing.  

A neural network, on the other hand, can analyze historical traffic data and learn how different factors contribute to congestion. Instead of manually defining rules, we **feed the network with data**—such as the time, location, and previous congestion levels—and let it determine the underlying patterns. Over time, the network adjusts its internal parameters to make increasingly accurate predictions, even for conditions it has never encountered before.  

To achieve this, we break down the data into **features**, which represent different characteristics of the input. In the case of traffic prediction, features might include the number of cars on the road, the presence of traffic signals, or the weather. However, not all features contribute equally to the outcome—some have a more significant impact than others. For example, the number of nearby factories might have a greater influence on congestion during rush hour than during late-night hours.  

To account for these variations, the neural network **assigns different weights** to each feature, indicating its level of importance. These weighted inputs are then processed and combined in a structured manner, passing through layers of neurons until a final prediction is made.   

# **Feed Forward Propagation**  

**Feed Forward Propagation**, where data moves in one direction—from the input layer, through the hidden layers, and finally to the output layer. Each neuron in the network receives inputs, processes them using mathematical operations, and passes the result forward until the final prediction is made.  

At the heart of this process is the **weighted sum operation**, which determines how much influence each input has on a neuron’s activation. If we consider a single neuron, it takes an input vector $ X $, multiplies it with a corresponding weight vector $ W $, adds a bias term $B$, and applies an **activation function** to introduce non-linearity. Mathematically, this can be expressed as:  
$Z_i = \sum_{j} W_{ij} X_j + B_i$

where:  
- $ X_j $ represents the input features,  
- $ W_{ij} $ are the weights assigned to each input,  
- $B_i$ is the bias term that helps shift the activation threshold, and  
- $ Z_i $ is the result before applying an activation function.  

Without an activation function, a neural network would behave like a simple linear model, limiting its ability to capture complex relationships in data. To introduce non-linearity, we apply an activation function to $ Z_i $, transforming it into an output $ A_i $:  
$A_i = f(Z_i)$

where $ f(Z) $ can be one of several activation functions, such as:  

- **Sigmoid**:  
  $  A = \frac{1}{1 + e^{-Z}} $
  Useful for probabilities but prone to saturation issues.  

- **ReLU (Rectified Linear Unit)**:  
  $  A = \max(0, Z)  $

  Commonly used in deep networks due to its ability to mitigate the vanishing gradient problem.  

This process continues across multiple layers, where the outputs of one layer serve as the inputs for the next. By the time data reaches the output layer, the network has mapped raw inputs to a meaningful prediction. However, the real challenge lies in determining the **optimal values of the weights and biases**, ensuring that the model accurately represents the underlying patterns in the data. This is where **Backpropagation** comes into play.  

# **Backpropagation**  

Once the network makes a prediction, we must evaluate how far this prediction is from the actual expected result. The difference between the predicted and actual values is known as the **error**, which must be minimized for the model to improve. Backpropagation is the process through which the network **adjusts its weights and biases** to reduce this error over multiple training iterations.  

The first step in backpropagation is defining a **loss function**, which quantifies how incorrect the predictions are. Different loss functions are used based on the type of problem:  

- For **regression problems**, the **Mean Squared Error (MSE)** is commonly used:  
  $  L = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2  $
  
  where
  - $ Y_i $ is the actual value
  - $ \hat{Y}_i $ is the predicted value.  

- For **classification problems**, the **Cross-Entropy Loss** is preferred:  
  $  L = - \sum_{i=1}^{n} Y_i \log(\hat{Y}_i)  $

  - which penalizes incorrect predictions more aggressively.  

Once the loss is calculated, the network must adjust its weights to minimize this value. The adjustment is guided by the **gradient descent** algorithm, which determines how much each weight contributes to the error and updates it accordingly. Using **partial derivatives**, we compute how changes in each weight affect the loss:  

  $  \frac{\partial L}{\partial W} = \frac{\partial L}{\partial A} \times \frac{\partial A}{\partial Z} \times \frac{\partial Z}{\partial W}  $

where:  
- $ \frac{\partial L}{\partial A} $ represents how the loss changes with respect to the neuron’s activation,  
- $ \frac{\partial A}{\partial Z} $ shows how the activation changes with respect to the weighted sum,  
- $ \frac{\partial Z}{\partial W} $ indicates how the weighted sum changes with respect to the weight.  

This formula follows the **chain rule** from calculus, allowing the error to be propagated **backward** through the network, layer by layer. After computing the gradients, we update the weights using the following rule:  

$ W_{\text{new}} = W_{\text{old}} - \eta \times \frac{\partial L}{\partial W} $

where $ \eta $ (the **learning rate**) controls how aggressively weights are adjusted. If the learning rate is too high, the model may overshoot optimal values; if too low, the training process becomes slow and inefficient.  

Backpropagation repeats over many training cycles (epochs), gradually refining the weights and biases until the loss is minimized. This iterative adjustment allows the neural network to **generalize well to new data**, improving its ability to make accurate predictions.  

# **Learning Rate and Loss Function**  

For a neural network to make accurate predictions, it must continuously adjust its parameters—specifically the **weights** and **biases**—to minimize errors. The process of adjusting these parameters relies on two fundamental concepts: the **loss function**, which measures how far off a prediction is, and the **learning rate**, which determines how aggressively the model updates its parameters.  

The **loss function** serves as a guide for the model’s improvement, quantifying the difference between the actual values and the predicted outputs. In simple terms, it answers the question: *How wrong was the model?* The goal of training is to minimize this loss over time. Different types of loss functions are used depending on the nature of the problem:  

- **For regression tasks**, where the output is a continuous value (such as predicting house prices), the **Mean Squared Error (MSE)** is commonly used:  
  $  L = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2  $  
  Here, $ Y_i $ represents the actual target value, $ \hat{Y}_i $ is the predicted value, and $ n $ is the number of training examples. The function penalizes larger errors more heavily, ensuring that the model focuses on reducing significant deviations.  

- **For binary classification problems**, such as determining whether an email is spam or not, **Binary Cross-Entropy** is preferred:  
  $  L = - \frac{1}{n} \sum_{i=1}^{n} \left[ Y_i \log(\hat{Y}_i) + (1 - Y_i) \log(1 - \hat{Y}_i) \right]  $  
  This function forces the model to assign high confidence only to correct predictions while heavily penalizing incorrect classifications.  

- **For multi-class classification problems**, such as recognizing digits in handwritten text, **Categorical Cross-Entropy** is used:  
  $  L = - \sum_{i=1}^{n} Y_i \log(\hat{Y}_i)  $  
  Here, the model assigns probabilities to multiple categories, and the function encourages the model to maximize the likelihood of the correct category.  

Once the loss is calculated, the neural network must determine how to adjust its parameters to reduce this loss. This is where the **learning rate** comes into play.  

The **learning rate** $ \eta $ is a hyperparameter that dictates the step size in weight adjustments during gradient descent. If the learning rate is too **high**, the model makes large updates, potentially overshooting the optimal values, leading to instability. If it is too **low**, the model takes excessively small steps, slowing down learning and increasing computational cost.  

The weight update formula incorporating the learning rate is given by:  

$ W_{\text{new}} = W_{\text{old}} - \eta \times \frac{\partial L}{\partial W}$

where $ \frac{\partial L}{\partial W} $ represents the gradient of the loss function with respect to the weight. This process repeats over multiple iterations, allowing the network to gradually converge towards an optimal set of parameters.  

Choosing an appropriate learning rate is crucial. One common approach is to start with a moderate value and adjust it dynamically during training. Techniques such as **adaptive learning rates** (e.g., Adam, RMSprop) modify $ \eta $ automatically to improve efficiency.  

Ultimately, the interplay between the **loss function** and **learning rate** determines how well a neural network learns from its data. By minimizing loss efficiently, the model improves its ability to generalize, making accurate predictions even for unseen data.  


# **Activation Functions**  

Activation functions are essential components of neural networks, introducing **non-linearity** to the model. Without activation functions, a neural network would behave like a simple linear model, incapable of capturing complex patterns in data. These functions determine whether a neuron should "fire" or remain inactive based on the input it receives.  

Mathematically, an activation function takes the weighted sum of inputs, $ Z $ and transforms it into an output $ A $:  

$ A = f(Z) $

Several activation functions exist, each with its advantages and drawbacks:  

### **Sigmoid Function**  

One of the earliest activation functions used in neural networks, the **sigmoid function** squashes inputs into a range between 0 and 1, making it useful for probability-based predictions. It is defined as:  

$ \sigma(Z) = \frac{1}{1 + e^{-Z}} $

This function ensures that very large or very small values are mapped to a finite range, preventing extreme outputs. However, it suffers from a **vanishing gradient problem**, where gradients become too small during backpropagation, slowing down learning.  

### **Tanh (Hyperbolic Tangent) Function**  

Similar to the sigmoid function but centered around zero, the **tanh function** maps inputs between -1 and 1:  

$\tanh(Z) = \frac{e^Z - e^{-Z}}{e^Z + e^{-Z}}$

This function helps ensure that activations remain balanced between positive and negative values, reducing bias in deep networks. However, it still faces the **vanishing gradient issue**, particularly in deeper architectures.  

### **ReLU (Rectified Linear Unit) Function**  

The **ReLU function** is one of the most widely used activation functions in deep learning due to its simplicity and effectiveness. It is defined as:  

$f(Z) = \max(0, Z)$

ReLU allows positive values to pass through unchanged while setting negative values to zero, introducing sparsity in neural networks. This sparsity improves computational efficiency and helps mitigate the vanishing gradient problem. However, ReLU suffers from the **dying ReLU problem**, where neurons can become inactive if they consistently receive negative inputs.  

### **Leaky ReLU and Parametric ReLU**  

To address the dying ReLU issue, variations such as **Leaky ReLU** introduce a small slope for negative inputs:  

$f(Z) = \max(\alpha Z, Z)$

where $ \alpha $ is a small positive constant (e.g., 0.01). A more flexible version, **Parametric ReLU (PReLU)**, allows $ \alpha $ to be learned during training.  

### **Softmax Function**  

For multi-class classification tasks, the **Softmax function** is commonly used in the output layer. It converts raw scores into probabilities that sum to 1, making interpretation easier:  

$\text{Softmax}(Z_i) = \frac{e^{Z_i}}{\sum_{j} e^{Z_j}}$


In [None]:
import numpy as np

class NeuralNetwork():
    def __init__(self, input_size, hidden_nodes, output_size, learning_rate=0.1):
        """
        :param input_size: Number of input neurons
        :param hidden_nodes: List specifying number of neurons in each hidden layer
        :param output_size: Number of output neurons
        :param learning_rate: Learning rate for weight updates

        """
        self.input_size = input_size
        self.hidden_nodes = hidden_nodes  # List specifying neurons per hidden layer
        self.output_size = output_size
        self.learning_rate = learning_rate

        # Define the architecture: input layer → hidden layers → output layer
        layer_sizes = [input_size] + hidden_nodes + [output_size]

        # Initialize weights and biases dynamically
        self.weights = [np.random.rand(layer_sizes[i], layer_sizes[i+1]) - 0.5 for i in range(len(layer_sizes) - 1)]
        self.biases = [np.random.rand(layer_sizes[i+1]) - 0.5 for i in range(len(layer_sizes) - 1)]

    def sigmoid(self, x):
      return 1 / (1 + np.exp(-x))


    def sigmoid_derivative(self, x):
        return x * (1 - x)

    def feedForward(self, inputs):
        # Forward propagation through all layers.......
        self.layers = [inputs]  # Store activations of all layers
        for i in range(len(self.weights)):
            inputs = self.sigmoid(np.dot(inputs, self.weights[i]) + self.biases[i])
            self.layers.append(inputs)  # Save outputs of [i+1] layer for backpropagation
        return inputs


    # Back Prop, to update the weights and Biases of the nueral network
    def backpropagation(self, target_output):
        errors = [target_output - self.layers[-1]]  # Output layer error
        deltas = [errors[0] * self.sigmoid_derivative(self.layers[-1])]  # Output layer delta

        # Get Dltas For each hidden layer in reverse order
        for i in range(len(self.hidden_nodes), 0, -1):
            errors.insert(0, np.dot(deltas[0], self.weights[i].T))  # Error of previous layer
            deltas.insert(0, errors[0] * self.sigmoid_derivative(self.layers[i]))  # Delta, previous layer

        # Update weights and biases
        for i in range(len(self.weights)):
            self.weights[i] += np.dot(self.layers[i].reshape(-1, 1), deltas[i].reshape(1, -1)) * self.learning_rate
            self.biases[i] += deltas[i] * self.learning_rate

    def train(self, X, y, epochs=10000):
        for epoch in range(epochs):
            total_loss = 0
            for i in range(len(X)):
                self.feedForward(X[i])
                self.backpropagation(y[i])
                total_loss += np.sum(np.abs(y[i] - self.layers[-1]))

            if epoch % 3000 == 0:
                print(f"Epoch {epoch}, Loss: {(total_loss / len(X)):.2f}")

    def predict(self, X):
        return [self.feedForward(x) for x in X]


    def save_model(self, filename):
        with open(filename, 'wb') as file:
            np.save(file, self.input_size)
            np.save(file, self.hidden_nodes)
            np.save(file, self.output_size)


            # Save all weights and biases
            for weight in self.weights:
                np.save(file, weight)

            for bias in self.biases:
                np.save(file, bias)


    def Load_Model(self, filename):
      # Open the file in read mode
      with open(filename, 'rb') as file:
          self.input_size = np.load(file)
          self.hidden_nodes = np.load(file)
          self.output_size = np.load(file)
          # Size of the weights = [len(inputs)*Hidden[0]][Hidden[0]*]
          self.weights = []
          self.biases = []
          size = [self.input_size] + self.hidden_nodes + [self.output_size]
          self.weights = [np.load(file) for _ in range(len(size) + 1)]
          self.biases = [np.load(file) for _ in range(len(size) + 1)]
          # print(f'Biases : {self.biases}')
          # print(f'Weights : {self.weights}')
          # print(f'Input Size : {self.input_size}')
          # print(f'Hidden Nodes : {self.hidden_nodes}')
          # print(f'Output Size : {self.output_size} ')
          print(f'Model : {filename} has been Loaded successfully')


# OR dataset
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
#  Desired Output/ Target Output
y = np.array([
    [0],
    [1],
    [1],
    [1]
])

# Create A Nueral Network with 2 inputs, and 2 Hidden Layers with nodes each,
nn = NeuralNetwork(input_size=2, hidden_nodes=[20], output_size=1, learning_rate=0.2)
nn.train(X, y, epochs=10000)

"""
The Model is too small and the learning rate is pretty quick
So, The 50 Thousands epochs are not tha much of a deal,
ever since the model has to solve a basic problem

"""
nn.save_model('model.txt')


Epoch 0, Loss: 0.62
Epoch 3000, Loss: 0.02
Epoch 6000, Loss: 0.01
Epoch 9000, Loss: 0.01


In [None]:
nn.Load_Model('model.txt')