## 1. Theory Introduction to Backpropagation

Backpropagation is the backbone of most of the modern neural networks. It's the algorithm used for minimizing the error by adjusting the weights of the network, based on the gradient of the loss function.

### Basics:
Given a feed-forward neural network, the initial pass where input is passed forward through the network to generate an output is known as **forward propagation**. The error or difference between this output and the true value is then calculated. 

Backpropagation, often known as "backward propagation of errors," involves:
1. Calculating the gradient of the loss function with respect to each weight by using the chain rule.
2. Updating the weights in the network in the opposite direction to the gradient. This means if a particular weight was responsible for a large portion of the error, it would be adjusted more than a weight that was only responsible for a small portion of the error.

A key thing to understand is that backpropagation requires a known, desired output for each input value – it's a form of supervised learning.

The basic idea revolves around how changing the weights impacts the overall error. The algorithm uses the chain rule of calculus to compute the error contribution of each weight.

## 2. Dataset

In [None]:
# Importing necessary libraries
import numpy as np

# Simple dataset (XOR problem)
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

## 3. Model coded in Python

In [None]:
# Hyperparameters
input_size = 2
hidden_size = 2
output_size = 1
learning_rate = 0.5

# Initialize weights and biases
np.random.seed(0)
weights_input_hidden = np.random.rand(input_size, hidden_size)
weights_hidden_output = np.random.rand(hidden_size, output_size)
biases_hidden = np.random.rand(1, hidden_size)
biases_output = np.random.rand(1, output_size)

# Activation and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)

# Training the model with backpropagation
epochs = 10000
for epoch in range(epochs):
    # Forward propagation
    hidden_layer_input = np.dot(X, weights_input_hidden) + biases_hidden
    hidden_layer_output = sigmoid(hidden_layer_input)
    output_layer_input = np.dot(hidden_layer_output, weights_hidden_output) + biases_output
    predicted_output = sigmoid(output_layer_input)

    # Calculate the error
    error = y - predicted_output
    
    # Backpropagation
    d_predicted_output = error * sigmoid_derivative(predicted_output)
    
    error_hidden_layer = d_predicted_output.dot(weights_hidden_output.T)
    d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_output)
    
    # Update the weights
    weights_hidden_output += hidden_layer_output.T.dot(d_predicted_output) * learning_rate
    weights_input_hidden += X.T.dot(d_hidden_layer) * learning_rate
    
print(predicted_output)

## 4. Explanation

In the provided Python code:

1. **Forward Propagation**: We began by initializing random weights and biases. Inputs are passed through the network to get the predicted output.

2. **Error Calculation**: The difference between the actual output (`y`) and the predicted output is calculated.

3. **Backpropagation**:
    * The error is multiplied with the derivative of the activation function (in this case, sigmoid) to get the gradient or 'direction and rate' at which the weights need to be changed.
    * This process is then repeated for hidden layers (in this simple network, there's only one hidden layer).
    
4. **Weight Update**: The weights are updated in the direction which reduces the error. The `learning_rate` decides by how much we adjust the weights.

After training for 10,000 epochs, our neural network should be able to approximate the XOR function. The printed `predicted_output` at the end of training should be close to the actual outputs of the XOR function for the given inputs.

The essence of backpropagation is captured in how we calculate the error at each layer and adjust the weights accordingly. By repeatedly adjusting the weights using the gradients, we aim to minimize the error, allowing the network to learn the patterns in the data.
