# Creating a Neural Network from Scratch in Python
## 1. Introduction
In this notebook, I will implement a simple neural network from scratch using Python's standard libraries, NumPy and Matplotlib

Personal goal: Create a neural network that can differentiate dog and cat pictures:
#### Base Neural Network

<div>
<center><img src="attachment:2499c7cc-5fd5-49ec-a0fe-d10be876f7a6.png" width="650" align="center"/></center>
</div>

#### Binary Classification (Dog/Cat-Output Layer) based on Input Layer (Pixel Values of Dog and Cat Pictures)
<div>
<center><img src="attachment:dc8af289-8ffe-4191-8631-f54de932cbb2.png" width="650" align="center"/></center>
</div>
<div>
<center><img src="attachment:452f7f75-95e4-488c-8998-a9552d73620a.png" width="650" align="center"/></center>
</div>

In [1]:
import sys
import numpy as np
import matplotlib

## Canonical Depiction of Calculating a neuron in the second layer
![image.png](attachment:f57f04e1-22e7-4b29-ba29-786f3fc43b62.png)

In [4]:
#In this example: 3 input neurons, 3 weights, 1 bias an 1 output neuron
#Output = Dot Product + Bias
inputs = [1,2,3]
weights = [0.2, 0.8, -0.5]
bias = 2

output = inputs[0]*weights[0] + inputs[1]*weights[1] + inputs[2]*weights[2] + bias
print(output)

2.3


### But what **can** be inputs? 
In a classical (feedforward) neural network, the **input nodes** represent individual numerical features of the data. These features must be **converted into numbers** before they can be fed into the network. Each input node corresponds to one numeric value from the input vector.

Common examples:

- **Numerical features** from structured/tabular data  
  - Example: age, income, temperature

- **Pixels of an image**  
  - Each input node corresponds to a pixel intensity (grayscale or color channels flattened)
  - Example: a 28x28 image = 784 input nodes

- **Encoded categorical variables**  
  - One-hot encoded or label encoded categories  
  - Example: gender, country, product type

- **Word representations** from text  
  - One-hot vectors, word embeddings, or frequency-based encodings (e.g., TF-IDF)  
  - Example: vectorized representation of a sentence

- **Time-series values**  
  - A fixed-size window of time steps as inputs  
  - Example: last 10 temperature readings → 10 input nodes

**Note:** Regardless of the data type (image, text, audio, etc.), all inputs must be transformed into a **1D numerical vector** before feeding into the network. Classical neural networks do not process structured or sequential data natively — such tasks are better handled by CNNs, RNNs, or Transformers.

### Another example: 4 input neurons and 3 output neurons
![image.png](attachment:9ca67aea-1044-4b5c-a9bc-05913c492c40.png)

In [12]:
#In this example: 4 input neurons, 3 weights, 1 bias and 3 output neurons
#Output = Dot Product + Bias
inputs = [1, 2, 3, 2.5]

weights1 = [0.2, 0.8, -0.5, 1.0]
weights2 = [0.5, -0.91, 0.26, -0.5]
weights3 = [-0.26, -0.27, 0.17, 0.87]

bias1 = 2
bias2 = 3
bias3 = 0.5

output = [inputs[0]*weights1[0] + inputs[1]*weights1[1] + inputs[2]*weights1[2] + inputs[3]*weights1[3] + bias1,
         inputs[0]*weights2[0] + inputs[1]*weights2[1] + inputs[2]*weights2[2] + inputs[3]*weights2[3] + bias2,
         inputs[0]*weights3[0] + inputs[1]*weights3[1] + inputs[2]*weights3[2] + inputs[3]*weights3[3] + bias3]
print(output)

#This can be represented better, more efficient, see below :) 

[4.8, 1.21, 2.385]


#### NumPy Shapes

General Question: At each dimension, what's the size of that dimension? 
Simple List:
<div>
<center><img src="attachment:e4a5eb70-00c3-47d5-9c82-31daa9bb8a66.png" width="650" align="center"/></center>
</div>

List of Lists (at each dimension, they have to have the same size, otherwise **shape** does not make sense): 
<div>
<center><img src="attachment:26c323a3-097d-4723-b0e9-bbec5eadcbf0.png" width="650" align="center"/></center>
</div>

General List of Vectors:
<div>
<center><img src="attachment:4b51fff9-cb47-426a-96ca-126c8f75a99d.png" width="650" align="center"/></center>
</div>

#### Numpy Dot Product
<div>
<center><img src="attachment:084dbd9b-8f2a-4329-8ef0-203bb92bd193.png" width="650" align="center"/></center>
</div>

In [17]:
inputs = [1, 2, 3, 2.5] #shape(4,)

weights = [[0.2, 0.8, -0.5, 1.0],
          [0.5, -0.91, 0.26, -0.5],
          [-0.26, -0.27, 0.17, 0.87]] #shape(3,4)

biases = [2, 3, 0.5]

output = np.dot(weights, inputs) + biases #We want 3 output neurons <-> weights is first input with shape(3,4), otherwise shape error. It works since
# here we multiply weights * inputs, not vice versa, hence (3,4)*(4,) = (3,)
print(output)

[4.8   1.21  2.385]


<div>
<center><img src="attachment:214ec278-d60b-46b2-8539-4fbed7a13969.png" width="650" align="center"/></center>
</div>
.
<div>
<center><img src="attachment:996f35a7-4864-4a88-a48e-ae0c9ccf8baf.png" width="650" align="center"/></center>
</div>

# 2. Batches, Layers and Networks

Imagine we have 512 data samples, and we want a single neuron (e.g. in a linear regression) to fit a line through the data points.

If we update the model one sample at a time, the model adjusts to each individual sample independently. At every step, the line would shift drastically, overfitting to that single sample and forgetting all previous ones. There would be 512 different lines drawn, each tailored only to the current sample, ignoring the broader trend in the data -> unusable data.

Instead of updating after every single sample, we can group samples into **batches** (e.g., batches of 4, 16, 32, 64, or 128), and compute the average gradient over each batch. This has two main benefits:
- Smoother and more stable updates: each weight update is influenced by multiple samples, reducing the effect of outliers.
- Computational efficiency: modern hardware (like GPUs) processes batches much faster than single samples.

Problem: If we show the model all data samples at once, **it can overfit** the data and might have problems with new data.

In [None]:
inputs = [[1, 2, 3, 2.5],
         [2.0, 5.0, -1.0, 2.0],
         [-1.5, 2.7, 3.3, -0.8]]

weights = [[0.2, 0.8, -0.5, 1.0],
          [0.5, -0.91, 0.26, -0.5],
          [-0.26, -0.27, 0.17, 0.87]]

biases = [2, 3, 0.5]

#### Matrix Multiplication
With increasing batch size, we now don't have a single dot product, but rather a matrix multiplication that consists of singular row-column dot products:

<div>
<center><img src="attachment:caad1abe-7308-4e2b-93ff-05138046f01c.png" width="650" align="center"/></center>
</div>
.
<div>
<center><img src="attachment:a0990c1a-97e8-4c39-bcc6-9312b32cf405.png" width="650" align="center"/></center>
</div>
.
<div>
<center><img src="attachment:651d653e-c839-4472-ba17-b857e8c13f7d.png" width="650" align="center"/></center>
</div>

In [14]:
inputs = [[1, 2, 3, 2.5],
         [2.0, 5.0, -1.0, 2.0],
         [-1.5, 2.7, 3.3, -0.8]]

weights = [[0.2, 0.8, -0.5, 1.0],
          [0.5, -0.91, 0.26, -0.5],
          [-0.26, -0.27, 0.17, 0.87]]

biases = [2, 3, 0.5]

#output = np.dot(weights, inputs) + biases: This would now throw a shape-error, since shape(inputs & weights) = ((3,4), (3,4)). Must be (3,4)(4,3)!
#We must transpose the weights matrix

<div>
<center><img src="attachment:873f3460-5383-4d1f-a8c1-7314634e1eaf.png" width="650" align="center"/></center>
</div>

In [16]:
#Using np.dot, the python arrays are transformed to numpy arrays -> hence, to transpose, we have to transpose these listoflists via numpy arrays. If 
#we did not do that, it would calculate the wrong weights anyways. Why? np.dot second matrix is column. Only if we transpose the row to a column it is
#correct
output = np.dot(inputs, np.array(weights).T) + biases #This returns the output node values for each sample!
print (output)                

[[ 4.8    1.21   2.385]
 [ 8.9   -1.81   0.2  ]
 [ 1.41   1.051  0.026]]


If we did not transpose, [1, 2, 3, 2.5] would not be calculated with [0.2, 0.8, -0.5, 1.0], but rather [0.2, 0.5, -0.26] which just is wrong
<div>
<center><img src="attachment:e9205fa9-9651-4cfe-a32d-79795aa826b2.png" width="650" align="center"/></center>
</div>
.
<div>
<center><img src="attachment:9c9f4d47-1606-4ddf-8f0f-dd599cac2d5a.png" width="650" align="center"/></center>
</div>

### Adding another Layer

In [18]:
inputs = [[1, 2, 3, 2.5],
         [2.0, 5.0, -1.0, 2.0],
         [-1.5, 2.7, 3.3, -0.8]]

weights1 = [[0.2, 0.8, -0.5, 1.0],
          [0.5, -0.91, 0.26, -0.5],
          [-0.26, -0.27, 0.17, 0.87]]

biases1 = [2, 3, 0.5]

layer1_outputs = np.dot(inputs, np.array(weights1).T) + biases1 #This returns the output node values for layer 1 and are the input values for layer 2!
#----------------------------------------------------------------------------------------------------------------------------------------------------
weights2 = [[0.1, -0.14, 0.5],
          [-0.5, 0.12, -0.33],
          [-0.44, 0.73, -0.13]]

biases2 = [-1, 2, -0.5]

layer2_outputs = np.dot(layer1_outputs, np.array(weights2).T) + biases2
print(layer2_outputs)


[[ 0.5031  -1.04185 -2.03875]
 [ 0.2434  -2.7332  -5.7633 ]
 [-0.99314  1.41254 -0.35655]]


In [21]:
#Convert this to an object to abstract away unnecessary redundancy. 
#Here, I will go towards standard machine learning notation; inputs = "X"

X = [[1, 2, 3, 2.5],
     [2.0, 5.0, -1.0, 2.0],
     [-1.5, 2.7, 3.3, -0.8]]

#There are usually two ways to initialize a neural network. First: I have a trained mode that I saved and load in that model (saved weights and biases)
#Second: new neural network. How to initialize? I have weights and biases that have to be initialized: 1st: weights then 2nd: biases. How? 
#random values in the range between -1 and 1
np.random.seed(0)
class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):                     #0.1 such that the initial random values are smaller than 1 or -1; i.e. closer to 0
        self.weights = 0.1*np.random.randn(n_inputs, n_neurons) #initialize random weights based on shape (inputs, here 4) and number of output neurons
        self.biases = np.zeros((1, n_neurons)) #initialize random biases based on weights output shape. Here the argument is the tuple of a shape
        #The reason we give n_inputs, n_neurons and not vice versa is because in the forward pass we won't have to do a transpose every single time
    def forward(self, inputs): #For first layer: the actual initial inputs and for further layers the outputs of the previous layers                                    
        self.output = np.dot(inputs, self.weights) + self.biases #The same as before

#print(np.random.randn(4,3)) returns: 
#[[ 1.76405235  0.40015721  0.97873798]
 #[ 2.2408932   1.86755799 -0.97727788]
 #[ 0.95008842 -0.15135721 -0.10321885]
 #[ 0.4105985   0.14404357  1.45427351]]
        
#Now we add two hidden layers
layer1 = Layer_Dense(4, 5)
layer2 = Layer_Dense(5, 2)

layer1.forward(X) #The input is X, the weights and biases are initialized and the np.dot is calculated
print(layer1.output) #The output of the forward method, shape (3,5)

layer2.forward(layer1.output)
print(layer2.output) #shape(3,2)

[[ 0.10758131  1.03983522  0.24462411  0.31821498  0.18851053]
 [-0.08349796  0.70846411  0.00293357  0.44701525  0.36360538]
 [-0.50763245  0.55688422  0.07987797 -0.34889573  0.04553042]]
[[ 0.148296   -0.08397602]
 [ 0.14100315 -0.01340469]
 [ 0.20124979 -0.07290616]]
