# Introduction to Deep Learning with PyTorch

In this notebook, you'll get introduced to [PyTorch](http://pytorch.org/), a framework for building and training neural networks. PyTorch in a lot of ways behaves like the arrays you love from Numpy. These Numpy arrays, after all, are just tensors. PyTorch takes these tensors and makes it simple to move them to GPUs for the faster processing needed when training neural networks. It also provides a module that automatically calculates gradients (for backpropagation!) and another module specifically for building neural networks. All together, PyTorch ends up being more coherent with Python and the Numpy/Scipy stack compared to TensorFlow and other frameworks.



## Neural Networks

Deep Learning is based on artificial neural networks which have been around in some form since the late 1950s. The networks are built from individual parts approximating neurons, typically called units or simply "neurons." Each unit has some number of weighted inputs. These weighted inputs are summed together (a linear combination) then passed through an activation function to get the unit's output.

<img src="../../../images/AI_Programming_with_Python_ND_P2_L_01.png" width=400px>

Mathematically this looks like: 

$$
\begin{align}
y &= f(w_1 x_1 + w_2 x_2 + b) \\
y &= f\left(\sum_i w_i x_i +b \right)
\end{align}
$$

With vectors this is the dot/inner product of two vectors:

$$
h = \begin{bmatrix}
x_1 \, x_2 \cdots  x_n
\end{bmatrix}
\cdot 
\begin{bmatrix}
           w_1 \\
           w_2 \\
           \vdots \\
           w_n
\end{bmatrix}
$$

## Tensors

It turns out neural network computations are just a bunch of linear algebra operations on *tensors*, a generalization of matrices. A vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor, an array with three indices is a 3-dimensional tensor (RGB color images for example). The fundamental data structure for neural networks are tensors and PyTorch (as well as pretty much every other deep learning framework) is built around tensors.

<img src="../../../images/AI_Programming_with_Python_ND_P2_L_02.svg" width=600px>

With the basics covered, it's time to explore how we can use PyTorch to build a simple neural network.

In [2]:
import torch

torch.manual_seed(7)

def activation(x):
    """ Sigmoid activation function 
    
        Arguments
        ---------
        x: torch.Tensor
    """
    return 1/(1+torch.exp(-x))

In [4]:
# Features are 5 random normal variables
features = torch.randn((1, 5))
# True weights for our data, random normal variables again
weights = torch.randn_like(features)
# and a true bias term
bias = torch.randn((1, 1))

print("Features:", features)
print("Weights:", weights)
print("Bias:", bias)

Features: tensor([[0.1328, 0.1373, 0.2405, 1.3955, 1.3470]])
Weights: tensor([[2.4382, 0.2028, 2.4505, 2.0256, 1.7792]])
Bias: tensor([[-0.9179]])


Above I generated data we can use to get the output of our simple network. This is all just random for now, going forward we'll start using normal data. Going through each relevant line:

`features = torch.randn((1, 5))` creates a tensor with shape `(1, 5)`, one row and five columns, that contains values randomly distributed according to the normal distribution with a mean of zero and standard deviation of one. 

`weights = torch.randn_like(features)` creates another tensor with the same shape as `features`, again containing values from a normal distribution.

Finally, `bias = torch.randn((1, 1))` creates a single value from a normal distribution.

PyTorch tensors can be added, multiplied, subtracted, etc, just like Numpy arrays. In general, you'll use PyTorch tensors pretty much the same way you'd use Numpy arrays. They come with some nice benefits though such as GPU acceleration which we'll get to later. For now, use the generated data to calculate the output of this simple single layer network. 
> **Exercise**: Calculate the output of the network with input features `features`, weights `weights`, and bias `bias`. Similar to Numpy, PyTorch has a [`torch.sum()`](https://pytorch.org/docs/stable/torch.html#torch.sum) function, as well as a `.sum()` method on tensors, for taking sums. Use the function `activation` defined above as the activation function.

In [5]:
### Solution

# Now, make our labels from our data and true weights

y = activation(torch.sum(features * weights) + bias)
y = activation((features * weights).sum() + bias)

`torch.mm()` and `torch.matmul()` are both PyTorch functions for matrix operations, but they have important differences:

## torch.mm()

**Purpose**: Performs matrix multiplication specifically for 2D tensors (matrices).

**Requirements**:
- Both inputs must be exactly 2D tensors
- Inner dimensions must match: `(n × m) @ (m × p) → (n × p)`

**Example**:
```python
import torch

A = torch.tensor([[1, 2], [3, 4]])  # 2x2 matrix
B = torch.tensor([[5, 6], [7, 8]])  # 2x2 matrix

result = torch.mm(A, B)
# Result: [[19, 22], [43, 50]]
```

**Limitations**:
- Cannot handle broadcasting
- Only works with exactly 2D inputs
- Will error if inputs have different dimensions

## torch.matmul()

**Purpose**: General matrix multiplication with broadcasting support.

**Capabilities**:
- Handles various tensor dimensions (1D, 2D, 3D+)
- Supports batch operations
- Includes broadcasting rules

**Behavior by dimension**:
- **1D × 1D**: Dot product → scalar
- **2D × 2D**: Standard matrix multiplication (same as `torch.mm()`)
- **1D × 2D**: Vector-matrix multiplication
- **2D × 1D**: Matrix-vector multiplication
- **3D+**: Batch matrix multiplication

**Examples**:
```python
# 2D case (equivalent to torch.mm)
A = torch.tensor([[1, 2], [3, 4]])
B = torch.tensor([[5, 6], [7, 8]])
result = torch.matmul(A, B)  # Same as torch.mm(A, B)

# 1D vector dot product
v1 = torch.tensor([1, 2, 3])
v2 = torch.tensor([4, 5, 6])
dot_product = torch.matmul(v1, v2)  # Result: 32

# Batch matrix multiplication
batch_A = torch.randn(10, 3, 4)  # 10 matrices of size 3×4
batch_B = torch.randn(10, 4, 5)  # 10 matrices of size 4×5
batch_result = torch.matmul(batch_A, batch_B)  # Shape: (10, 3, 5)

# Broadcasting example
A = torch.randn(5, 3, 4)     # 5 matrices of 3×4
B = torch.randn(4, 2)        # Single 4×2 matrix
result = torch.matmul(A, B)  # B broadcasts to (5, 4, 2), result: (5, 3, 2)
```

## Key Differences Summary

| Feature | torch.mm() | torch.matmul() |
|---------|------------|----------------|
| **Input dimensions** | Exactly 2D only | 1D, 2D, 3D+ |
| **Broadcasting** | No | Yes |
| **Batch operations** | No | Yes |
| **Performance** | Slightly faster for 2D | More versatile |
| **Error handling** | Strict dimension requirements | Flexible with broadcasting |

## Practical Recommendations

**Use `torch.mm()`** when:
- You're certain both inputs are 2D matrices
- You want strict dimension checking
- Performance is critical for simple matrix multiplication

**Use `torch.matmul()`** when:
- Working with various tensor dimensions
- Need broadcasting capabilities
- Handling batch operations
- Want a general-purpose matrix multiplication function

In most modern PyTorch code, `torch.matmul()` is preferred due to its flexibility and broader applicability, especially in deep learning where batch operations and various tensor shapes are common.

---

You can do the multiplication and sum in the same operation using a matrix multiplication. In general, you'll want to use matrix multiplications since they are more efficient and accelerated using modern libraries and high-performance computing on GPUs.

Here, we want to do a matrix multiplication of the features and the weights. For this we can use [`torch.mm()`](https://pytorch.org/docs/stable/torch.html#torch.mm) or [`torch.matmul()`](https://pytorch.org/docs/stable/torch.html#torch.matmul) which is somewhat more complicated and supports broadcasting. If we try to do it with `features` and `weights` as they are, we'll get an error

```python
>> torch.mm(features, weights)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-13-15d592eb5279> in <module>()
----> 1 torch.mm(features, weights)

RuntimeError: size mismatch, m1: [1 x 5], m2: [1 x 5] at /Users/soumith/minicondabuild3/conda-bld/pytorch_1524590658547/work/aten/src/TH/generic/THTensorMath.c:2033
```

As you're building neural networks in any framework, you'll see this often. Really often. What's happening here is our tensors aren't the correct shapes to perform a matrix multiplication. Remember that for matrix multiplications, the number of columns in the first tensor must equal to the number of rows in the second tensor. Both `features` and `weights` have the same shape, `(1, 5)`. This means we need to change the shape of `weights` to get the matrix multiplication to work.

**Note:** To see the shape of a tensor called `tensor`, use `tensor.shape`. If you're building neural networks, you'll be using this method often.

There are a few options here: [`weights.reshape()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.reshape), [`weights.resize_()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.resize_), and [`weights.view()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view).

* `weights.reshape(a, b)` will return a new tensor with the same data as `weights` with size `(a, b)` sometimes, and sometimes a clone, as in it copies the data to another part of memory.
* `weights.resize_(a, b)` returns the same tensor with a different shape. However, if the new shape results in fewer elements than the original tensor, some elements will be removed from the tensor (but not from memory). If the new shape results in more elements than the original tensor, new elements will be uninitialized in memory. Here I should note that the underscore at the end of the method denotes that this method is performed **in-place**. Here is a great forum thread to [read more about in-place operations](https://discuss.pytorch.org/t/what-is-in-place-operation/16244) in PyTorch.
* `weights.view(a, b)` will return a new tensor with the same data as `weights` with size `(a, b)`.

I usually use `.view()`, but any of the three methods will work for this. So, now we can reshape `weights` to have five rows and one column with something like `weights.view(5, 1)`.

> **Exercise**: Calculate the output of our little network using matrix multiplication.
---

The `.view()` function in PyTorch reshapes a tensor without changing its data. `weights.view(5,1)` specifically reshapes the `weights` tensor into a 5×1 column vector.

## How .view() Works

**.view()** changes the tensor's shape while keeping the same data and total number of elements. It's similar to NumPy's `.reshape()`.

## Example with weights.view(5,1)

```python
import torch

# Original weights tensor (1D)
weights = torch.tensor([0.2, 0.5, 0.1, 0.8, 0.3])
print("Original shape:", weights.shape)  # torch.Size([5])
print("Original tensor:", weights)

# Reshape to column vector
weights_col = weights.view(5, 1)
print("New shape:", weights_col.shape)  # torch.Size([5, 1])
print("Reshaped tensor:")
print(weights_col)
```

**Output:**
```
Original shape: torch.Size([5])
Original tensor: tensor([0.2000, 0.5000, 0.1000, 0.8000, 0.3000])

New shape: torch.Size([5, 1])
Reshaped tensor:
tensor([[0.2000],
        [0.5000],
        [0.1000],
        [0.8000],
        [0.3000]])
```

## Why This Matters

**Broadcasting and Matrix Operations**: Converting to a column vector enables proper broadcasting in matrix operations:

```python
# Without .view() - 1D tensor
data = torch.randn(3, 5)  # 3 samples, 5 features
weights = torch.tensor([0.2, 0.5, 0.1, 0.8, 0.3])

# This works but may not behave as expected in some operations
result1 = data * weights  # Shape: (3, 5)

# With .view() - explicit column vector
weights_col = weights.view(5, 1)

# or for matrix multiplication:
result3 = torch.matmul(data, weights_col)  # Shape: (3, 1)
```

## Common Use Cases

**1. Matrix Multiplication:**
```python
X = torch.randn(100, 5)  # 100 samples, 5 features
w = torch.randn(5)       # weights

# Need column vector for proper matrix multiplication
w_col = w.view(5, 1)
output = torch.matmul(X, w_col)  # Shape: (100, 1)
```

**2. Neural Network Layers:**
```python
# Converting 1D bias to proper shape for addition
bias = torch.randn(10)
bias_reshaped = bias.view(1, 10)  # Row vector for broadcasting
```

## Key Properties of .view()

- **Memory sharing**: The new tensor shares memory with the original
- **Contiguous requirement**: Original tensor must be contiguous in memory
- **Element count preservation**: Total number of elements must remain the same

```python
weights = torch.tensor([1, 2, 3, 4, 5, 6])
# Valid reshapes (6 elements total):
weights.view(2, 3)  # 2×3 matrix
weights.view(3, 2)  # 3×2 matrix  
weights.view(6, 1)  # 6×1 column vector
weights.view(1, 6)  # 1×6 row vector

# Invalid reshape:
# weights.view(2, 4)  # Error: 2×4 = 8 ≠ 6 elements
```

The `.view(5,1)` operation is commonly used when you need to ensure a tensor has the proper dimensionality for matrix operations or broadcasting, converting a 1D tensor into an explicit column vector format.

---

In [7]:
## Solution

y = activation(torch.mm(features, weights.view(5,1)) + bias)
print(y)


tensor([[0.1595]])


# Stack them up!

That's how you can calculate the output for a single neuron. The real power of this algorithm happens when you start stacking these individual units into layers and stacks of layers, into a network of neurons. The output of one layer of neurons becomes the input for the next layer. With multiple input units and output units, we now need to express the weights as a matrix.

<img src='../../../images/AI_Programming_with_Python_ND_P2_L_03.png' width=450px>

The first layer shown on the bottom here are the inputs, understandably called the **input layer**. The middle layer is called the **hidden layer**, and the final layer (on the right) is the **output layer**. We can express this network mathematically with matrices again and use matrix multiplication to get linear combinations for each unit in one operation. For example, the hidden layer ($h_1$ and $h_2$ here) can be calculated 

$$
\vec{h} = [h_1 \, h_2] = 
\begin{bmatrix}
x_1 \, x_2 \cdots \, x_n
\end{bmatrix}
\cdot 
\begin{bmatrix}
           w_{11} & w_{12} \\
           w_{21} &w_{22} \\
           \vdots &\vdots \\
           w_{n1} &w_{n2}
\end{bmatrix}
$$

The output for this small network is found by treating the hidden layer as inputs for the output unit. The network output is expressed simply

$$
y =  f_2 \! \left(\, f_1 \! \left(\vec{x} \, \mathbf{W_1}\right) \mathbf{W_2} \right)
$$

---

Here's a diagram of the neural network architecture based on your code:

```mermaid
graph LR
    %% Input Layer
    X1["x₁"] --> H1["h₁"]
    X1 --> H2["h₂"]
    X2["x₂"] --> H1
    X2 --> H2
    X3["x₃"] --> H1
    X3 --> H2
    
    %% Hidden to Output
    H1 --> O1["output"]
    H2 --> O1
    
    %% Bias nodes
    B1_node["bias₁"] --> H1
    B1_node --> H2
    B2_node["bias₂"] --> O1
    
    %% Layer labels
    subgraph "Input Layer (3 features)"
        X1
        X2
        X3
    end
    
    subgraph "Hidden Layer (2 units)"
        H1
        H2
        B1_node
    end
    
    subgraph "Output Layer (1 unit)"
        O1
        B2_node
    end
    
    style X1 fill:#BBDEFB
    style X2 fill:#BBDEFB
    style X3 fill:#BBDEFB
    style H1 fill:#E8F5E8
    style H2 fill:#E8F5E8
    style O1 fill:#FFE0B2
    style B1_node fill:#F3E5F5
    style B2_node fill:#F3E5F5
```

**Network Architecture Details:**

**Dimensions and Flow:**
- **Input Layer**: 3 features → `features.shape = (1, 3)`
- **Hidden Layer**: 2 neurons → `n_hidden = 2`
- **Output Layer**: 1 neuron → `n_output = 1`

**Weight Matrices:**
- **W1**: `(3, 2)` - connects input to hidden layer
- **W2**: `(2, 1)` - connects hidden to output layer

**Bias Vectors:**
- **B1**: `(1, 2)` - bias for hidden layer
- **B2**: `(1, 1)` - bias for output layer

**Mathematical Operations:**

```python
# Forward pass would be:
hidden_input = torch.matmul(features, W1) + B1    # Shape: (1, 2)
hidden_output = activation_function(hidden_input)  # Apply activation (e.g., ReLU, sigmoid)

final_output = torch.matmul(hidden_output, W2) + B2  # Shape: (1, 1)
```

**Key Observations:**
- This is a simple feedforward network with one hidden layer
- The bias terms are shaped as row vectors `(1, n)` to broadcast correctly with batch operations
- Each connection represents learnable parameters that will be updated during training
- The network can handle batch processing since the first dimension can accommodate multiple samples

---

In [10]:
### Generate some data
torch.manual_seed(7) # Set the random seed so things are predictable

# Features are 3 random normal variables
features = torch.randn((1, 3))

# Define the size of each layer in our network
n_input = features.shape[1]     # Number of input units, must match number of input features
n_hidden = 2                    # Number of hidden units 
n_output = 1                    # Number of output units

# Weights for inputs to hidden layer
W1 = torch.randn(n_input, n_hidden)
# Weights for hidden layer to output layer
W2 = torch.randn(n_hidden, n_output)

# and bias terms for hidden and output layers
B1 = torch.randn((1, n_hidden))
B2 = torch.randn((1, n_output))


print("features: ", features)
print("n_input: ", n_input)
print("n_hidden: ", n_hidden)
print("n_output: ", n_output)
print("\n")
print("=====================")
print("\n")
print("W1: ", W1)
print("B1: ", B1)
print("W2: ", W2)
print("B2: ", B2)

features:  tensor([[-0.1468,  0.7861,  0.9468]])
n_input:  3
n_hidden:  2
n_output:  1




W1:  tensor([[-1.1143,  1.6908],
        [-0.8948, -0.3556],
        [ 1.2324,  0.1382]])
B1:  tensor([[0.1328, 0.1373]])
W2:  tensor([[-1.6822],
        [ 0.3177]])
B2:  tensor([[0.2405]])


> **Exercise:** Calculate the output for this multi-layer network using the weights `W1` & `W2`, and the biases, `B1` & `B2`. 

In [11]:
### Solution

h = activation(torch.mm(features, W1) + B1)
output = activation(torch.mm(h, W2) + B2)
print(output)

tensor([[0.3171]])


If you did this correctly, you should see the output `tensor([[ 0.3171]])`.

The number of hidden units a parameter of the network, often called a **hyperparameter** to differentiate it from the weights and biases parameters. As you'll see later when we discuss training a neural network, the more hidden units a network has, and the more layers, the better able it is to learn from data and make accurate predictions.

## Numpy to Torch and back

Special bonus section! PyTorch has a great feature for converting between Numpy arrays and Torch tensors. To create a tensor from a Numpy array, use `torch.from_numpy()`. To convert a tensor to a Numpy array, use the `.numpy()` method.

In [16]:
import numpy as np
a = np.random.rand(4,3)
a

array([[ 0.33669496,  0.59531562,  0.65433944],
       [ 0.86531224,  0.59945364,  0.28043973],
       [ 0.48409303,  0.98357622,  0.33884284],
       [ 0.25591391,  0.51081783,  0.39986403]])

In [17]:
b = torch.from_numpy(a)
b


 0.3367  0.5953  0.6543
 0.8653  0.5995  0.2804
 0.4841  0.9836  0.3388
 0.2559  0.5108  0.3999
[torch.DoubleTensor of size 4x3]

In [18]:
b.numpy()

array([[ 0.33669496,  0.59531562,  0.65433944],
       [ 0.86531224,  0.59945364,  0.28043973],
       [ 0.48409303,  0.98357622,  0.33884284],
       [ 0.25591391,  0.51081783,  0.39986403]])

The memory is shared between the Numpy array and Torch tensor, so if you change the values in-place of one object, the other will change as well.

In [19]:
# Multiply PyTorch Tensor by 2, in place
b.mul_(2)


 0.6734  1.1906  1.3087
 1.7306  1.1989  0.5609
 0.9682  1.9672  0.6777
 0.5118  1.0216  0.7997
[torch.DoubleTensor of size 4x3]

In [20]:
# Numpy array matches new values from Tensor
a

array([[ 0.67338991,  1.19063124,  1.30867888],
       [ 1.73062448,  1.19890728,  0.56087946],
       [ 0.96818606,  1.96715243,  0.67768568],
       [ 0.51182782,  1.02163565,  0.79972807]])