# Linear Algebra Foundations

Understanding linear algebra is essential for grasping how machine learning especially neural networks operate. Here’s a concise review of key concepts:

--

## Vectors and Matrices

A **vector** is an ordered list of numbers, often written as a column or row. Formally, an $n$-dimensional vector is:

$$
\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}
$$

**Examples:**

- 2D vector:  
    $$
    \mathbf{a} = \begin{bmatrix} 5 \\ 7 \end{bmatrix}
    $$
- 3D vector:  
    $$
    \mathbf{b} = \begin{bmatrix} 2 \\ -1 \\ 4 \end{bmatrix}
    $$
- 4D vector:  
    $$
    \mathbf{c} = \begin{bmatrix} 0 \\ 3 \\ 8 \\ -2 \end{bmatrix}
    $$

Vectors can represent features, inputs, or outputs in neural networks, with the dimension corresponding to the number of features or neurons. Vectors and matrices are the building blocks of linear algebra and are fundamental to machine learning and neural networks.

- **Vectors** are ordered lists of numbers, often representing features, data points, or weights. For example, a vector can represent the input features to a neural network layer:
  
  $$
  \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}
  $$
  
  In code, a vector might look like `np.array([5, 7])` for a 2D vector, or `np.array([2, -1, 4])` for a 3D vector.

- **Matrices** are rectangular arrays of numbers arranged in rows and columns. They are used to organize data and perform transformations. In neural networks, the weights connecting layers are typically stored in matrices. For example, a matrix $A$ with shape $2 \times 3$:
  
  $$
  A = \begin{bmatrix}
  1 & 2 & 3 \\
  4 & 5 & 6
  \end{bmatrix}
  $$
  
  In code, this is represented as `np.array([[1, 2, 3], [4, 5, 6]])`.

**Types of matrices commonly used:**
- **Square matrix:** Same number of rows and columns (e.g., $2 \times 2$).
  
  $$
  S = \begin{bmatrix}
  7 & 8 \\
  9 & 10
  \end{bmatrix}
  $$

- **Identity matrix:** Diagonal elements are 1, others are 0. Acts as a multiplicative identity.
  
  $$
  I = \begin{bmatrix}
  1 & 0 \\
  0 & 1
  \end{bmatrix}
  $$

- **Zero matrix:** All elements are zero.
  
  $$
  Z = \begin{bmatrix}
  0 & 0 \\
  0 & 0
  \end{bmatrix}
  $$

- **Diagonal matrix:** Only diagonal elements are nonzero.
  
  $$
  D = \begin{bmatrix}
  3 & 0 \\
  0 & 5
  \end{bmatrix}
  $$

**Why they matter:**
- Vectors represent data, weights, and activations.
- Matrices organize weights and data batches, and enable efficient computation through matrix multiplication.
- Understanding their properties (such as shape, transpose, and special types) is crucial for implementing and debugging neural networks.

- **Vectors** represent data points or features. In neural networks, inputs, weights, and outputs are often expressed as vectors.
- **Matrices** are used to organize collections of vectors and perform transformations. For example, the weights connecting layers in a neural network are typically stored in matrices.

---


## Norm

### Norm

The norm of a vector or matrix is a measure of its "size" or "length." The most common is the Euclidean (L2) norm, which is the square root of the sum of the squares of the elements.

#### Example: Vector Norm

For a vector $\mathbf{u} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}$, the L2 norm is:

$$
\|\mathbf{u}\|_2 = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} \approx 3.74
$$

#### Example: Matrix Norm

For a matrix $A$, the Frobenius norm is:

$$
\|A\|_F = \sqrt{\sum_{i,j} |a_{ij}|^2}
$$

For

$$
A = \begin{bmatrix}
1 & 2 & 3 \\
3 & 4 & 5
\end{bmatrix}
$$

the Frobenius norm is:

$$
\|A\|_F = \sqrt{1^2 + 2^2 + 3^2 + 3^2 + 4^2 + 5^2} = \sqrt{1 + 4 + 9 + 9 + 16 + 25} = \sqrt{64} = 8
$$

**Interpretation:**  
- Norms are used to measure distances, regularize weights (prevent overfitting), and quantify errors in neural networks.  
- L1 norm (sum of absolute values) and L2 norm (Euclidean) are common in machine learning.

---

## Element-wise Operations

In linear algebra and neural networks, **element-wise operations** refer to performing a specific operation on each individual element of a vector or matrix. These operations are fundamental for data transformation and are heavily used in tasks like normalization and activation functions.

### Scalars and Broadcasting

A **scalar** is a single numerical value. When a scalar is added to, subtracted from, multiplied by, or divided with a vector or matrix, the operation is applied **individually to each element** — a behavior known as **broadcasting**.

For example, multiplying a vector by a scalar:

$$
2 \cdot \begin{bmatrix} 1 \\\\ 2 \\\\ 3 \end{bmatrix} =
\begin{bmatrix} 2 \cdot 1 \\\\ 2 \cdot 2 \\\\ 2 \cdot 3 \end{bmatrix} =
\begin{bmatrix} 2 \\\\ 4 \\\\ 6 \end{bmatrix}
$$

Broadcasting is commonly used to **scale** or **shift** data, such as during normalization or when applying a bias term.

#### Example: Element-wise Vector Operations

Let the vectors:

$$
\mathbf{u} = \begin{bmatrix} 1 \\\\ 2 \\\\ 3 \end{bmatrix}, \quad
\mathbf{v} = \begin{bmatrix} 4 \\\\ 5 \\\\ 6 \end{bmatrix}
$$

#### 1. **Addition**

Each corresponding pair of elements is added:

$$
\mathbf{u} + \mathbf{v} =
\begin{bmatrix} 1 + 4 \\\\ 2 + 5 \\\\ 3 + 6 \end{bmatrix} =
\begin{bmatrix} 5 \\\\ 7 \\\\ 9 \end{bmatrix}
$$

#### 2. **Element-wise Multiplication** (Hadamard Product)

Each element is multiplied with the corresponding element of the other vector:

$$
\mathbf{u} \circ \mathbf{v} =
\begin{bmatrix} 1 \cdot 4 \\\\ 2 \cdot 5 \\\\ 3 \cdot 6 \end{bmatrix} =
\begin{bmatrix} 4 \\\\ 10 \\\\ 18 \end{bmatrix}
$$

#### 3. **Applying a Function (e.g., Squaring)**

Functions such as square, square root, ReLU, sigmoid, or tanh are often applied element-wise:

$$
\mathbf{u}^2 =
\begin{bmatrix} 1^2 \\\\ 2^2 \\\\ 3^2 \end{bmatrix} =
\begin{bmatrix} 1 \\\\ 4 \\\\ 9 \end{bmatrix}
$$

This principle applies to **activation functions** in neural networks, where functions like ReLU, sigmoid, or tanh are applied to each neuron output individually.


## Vector-on-Vector Element-wise Operations

Element-wise operations between vectors apply a binary operation (such as addition or multiplication) to each corresponding pair of elements from two vectors of the same size. These are also called **Hadamard operations** when referring to multiplication.

Let the vectors be:

$$
\mathbf{u} = \begin{bmatrix} 1 \\\\ 2 \\\\ 3 \end{bmatrix}, \quad
\mathbf{v} = \begin{bmatrix} 4 \\\\ 5 \\\\ 6 \end{bmatrix}
$$


1. **Element-wise Addition**


Each corresponding pair of elements is added:

$$
\mathbf{u} + \mathbf{v} =
\begin{bmatrix}
1 + 4 \\\\
2 + 5 \\\\
3 + 6
\end{bmatrix}
=
\begin{bmatrix}
5 \\\\
7 \\\\
9
\end{bmatrix}
$$


2. **Element-wise Multiplication** (Hadamard Product which is NOT the dot product)


Each element of $\mathbf{u}$ is multiplied by the corresponding element of $\mathbf{v}$:

$$
\mathbf{u} \circ \mathbf{v} =
\begin{bmatrix}
1 \cdot 4 \\\\
2 \cdot 5 \\\\
3 \cdot 6
\end{bmatrix}
=
\begin{bmatrix}
4 \\\\
10 \\\\
18
\end{bmatrix}
$$

3. **Element-wise Function Application (e.g., Squaring $\mathbf{u}$)**


You can also apply a function to each element of a single vector independently:

$$
\mathbf{u}^2 =
\begin{bmatrix}
1^2 \\\\
2^2 \\\\
3^2
\end{bmatrix}
=
\begin{bmatrix}
1 \\\\
4 \\\\
9
\end{bmatrix}
$$

---

These operations are fundamental in machine learning and neural networks, where transformations are often applied element-wise to vectors or matrices (e.g., during activation, loss computation, or feature scaling).


## Dot Product

The **dot product** (also called the **inner product**) of two vectors produces a **scalar** — a single number — by multiplying corresponding elements and summing the results. It is written using a **dot** notation: $\mathbf{a} \cdot \mathbf{b}$.

### Dot Product Formula

Given two vectors:

$$
\mathbf{a} = \begin{bmatrix} a_1 \\\\ a_2 \\\\ a_3 \end{bmatrix}, \quad
\mathbf{b} = \begin{bmatrix} b_1 \\\\ b_2 \\\\ b_3 \end{bmatrix}
$$

The dot product is:

$$
\mathbf{a} \cdot \mathbf{b} =
\begin{bmatrix}
a_1 \\\\
a_2 \\\\
a_3
\end{bmatrix}
\cdot
\begin{bmatrix}
b_1 \\\\
b_2 \\\\
b_3
\end{bmatrix}
$$

$$
\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + a_3 b_3
$$

In general:

$$
\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i
$$

### Example: Dot Product Calculation

Let:

$$
\mathbf{a} = \begin{bmatrix} 2 \\\\ 5 \\\\ 7 \end{bmatrix}, \quad
\mathbf{b} = \begin{bmatrix} 3 \\\\ 4 \\\\ 1 \end{bmatrix}
$$

Compute:

$$
\mathbf{a} \cdot \mathbf{b} = (2 \cdot 3) + (5 \cdot 4) + (7 \cdot 1) = 6 + 20 + 7 = 33
$$

**Result:**
$$
\mathbf{a} \cdot \mathbf{b} = 33
$$

This is a **scalar**, not a vector.



### Matrix Multiplication (Dot Product Generalized)

The **dot product** of vectors produces a scalar. When generalized to matrices, this becomes **matrix multiplication**, where each entry of the result is a dot product between a **row** of the first matrix and a **column** of the second matrix.

Let:

$$
A = \begin{bmatrix}
1 & 2 & 3 \\\\
4 & 5 & 6 \\\\
7 & 8 & 9
\end{bmatrix}, \quad
B = \begin{bmatrix}
9 & 8 & 7 \\\\
6 & 5 & 4 \\\\
3 & 2 & 1
\end{bmatrix}
$$

To compute the matrix product $C = AB$, each element $C_{ij}$ is calculated as the **dot product** of the $i$-th row of $A$ and the $j$-th column of $B$:

$$
C_{ij} = \sum_{k=1}^{3} A_{ik} \cdot B_{kj}
$$

---

#### Step-by-Step Example

We compute each element of $C = AB$:

- **First row of A** $\cdot$ **First column of B**:
  $$
  C_{11} = (1 \cdot 9) + (2 \cdot 6) + (3 \cdot 3) = 9 + 12 + 9 = 30
  $$

- **First row of A** $\cdot$ **Second column of B**:
  $$
  C_{12} = (1 \cdot 8) + (2 \cdot 5) + (3 \cdot 2) = 8 + 10 + 6 = 24
  $$

- **First row of A** $\cdot$ **Third column of B**:
  $$
  C_{13} = (1 \cdot 7) + (2 \cdot 4) + (3 \cdot 1) = 7 + 8 + 3 = 18
  $$

- **Second row of A** $\cdot$ **First column of B**:
  $$
  C_{21} = (4 \cdot 9) + (5 \cdot 6) + (6 \cdot 3) = 36 + 30 + 18 = 84
  $$

- **Second row of A** $\cdot$ **Second column of B**:
  $$
  C_{22} = (4 \cdot 8) + (5 \cdot 5) + (6 \cdot 2) = 32 + 25 + 12 = 69
  $$

- **Second row of A** $\cdot$ **Third column of B**:
  $$
  C_{23} = (4 \cdot 7) + (5 \cdot 4) + (6 \cdot 1) = 28 + 20 + 6 = 54
  $$

- **Third row of A** $\cdot$ **First column of B**:
  $$
  C_{31} = (7 \cdot 9) + (8 \cdot 6) + (9 \cdot 3) = 63 + 48 + 27 = 138
  $$

- **Third row of A** $\cdot$ **Second column of B**:
  $$
  C_{32} = (7 \cdot 8) + (8 \cdot 5) + (9 \cdot 2) = 56 + 40 + 18 = 114
  $$

- **Third row of A** $\cdot$ **Third column of B**:
  $$
  C_{33} = (7 \cdot 7) + (8 \cdot 4) + (9 \cdot 1) = 49 + 32 + 9 = 90
  $$

#### Final Result:

Putting it all together:

$$
C = AB = \begin{bmatrix}
30 & 24 & 18 \\\\
84 & 69 & 54 \\\\
138 & 114 & 90
\end{bmatrix}
$$


#### Key Point

Matrix multiplication is **not** element-wise. It involves **dot products** between rows and columns, resulting in new combinations of values — not just simple multiplication.



In [23]:
import numpy as np

#a = np.array([2, 5, 7])
a = np.array([[1,2,3], [4,5,6], [7,8,9]])
#b = np.array([3, 4, 1])
b = np.array([[9,8,7], [6,5,4], [3,2,1]])
dot_product = np.dot(a, b)
print("Dot product:", dot_product)

Dot product: [[ 30  24  18]
 [ 84  69  54]
 [138 114  90]]


### Summary: Dot Product vs. Matrix Multiplication

- The **dot product** takes two vectors and returns a scalar.
- **Matrix multiplication** generalizes this idea: each element in the result is a dot product between a row of the first matrix and a column of the second.



This is how the "dot product" generalizes to matrices: each entry in the result is a dot product of a row from $A$ and a column from $B$.

### Dot Product vs. Element-wise Multiplication

| Operation                  | Symbol     | Output Type | Description                                         |
|---------------------------|------------|---------------|-----------------------------------------------------|
| Dot product               | $\cdot$    | Scalar/Matrix | Multiply and sum: $a_1b_1 + a_2b_2 + \dots$         |
| Element-wise multiplication | $\circ$    | Vector      | Multiply each element separately: $[a_1b_1, a_2b_2, \dots]$ |

### Why It Matters

- The **dot product** is used in neural networks to compute the **weighted sum** of inputs before applying an activation function — this is the basic operation of a neuron.
- **Element-wise operations**, on the other hand, are used for scaling, normalization, and activation functions that apply independently to each element.

---



## Transpose

The transpose of a matrix flips its rows and columns. It is often used to align dimensions for multiplication.

### Example: Matrix Transpose

Suppose we have a matrix:

$$
A = \begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6
\end{bmatrix}
$$

The transpose of $A$, denoted $A^\top$, is:

$$
A^\top = \begin{bmatrix}
1 & 4 \\
2 & 5 \\
3 & 6
\end{bmatrix}
$$

**Result:**  
Rows become columns and columns become rows.


In [2]:
# try changing the shape of the array
A = np.array([[1, 2, 3], [3, 4, 5]])
print(f"A\n{A}")
B = A.reshape(6, 1)
print(f"A.reshape(6, 1):\n{B}")
C = A.reshape(1, 6)
print(f"A.reshape(1, 6):\n{C}")
D = A.reshape(2, 3)
print(f"A.reshape(2, 3):\n{D}")
E = A.reshape(3, 2)
print(f"A.reshape(3, 2):\n{E}")
print(f"A:\n{A}")
transpose = A.T
print(f"transpose:\n{transpose}")

A
[[1 2 3]
 [3 4 5]]
A.reshape(6, 1):
[[1]
 [2]
 [3]
 [3]
 [4]
 [5]]
A.reshape(1, 6):
[[1 2 3 3 4 5]]
A.reshape(2, 3):
[[1 2 3]
 [3 4 5]]
A.reshape(3, 2):
[[1 2]
 [3 3]
 [4 5]]
A:
[[1 2 3]
 [3 4 5]]
transpose:
[[1 3]
 [2 4]
 [3 5]]



## Determinant

The determinant is a scalar value that can be computed from a square matrix. It provides important information about the matrix, such as whether it is invertible and how it scales space.

The determinant tells us important things about a matrix. In general, it helps us know if a matrix can be inverted (reversed), how it changes the size or orientation of shapes when used for transformations, and whether a system of linear equations has a unique solution. If the determinant is zero, the matrix cannot be inverted and the system may not have a unique solution.

The general formula for the determinant of an $n \times n$ matrix $A$ is:

$$
\det(A) = \sum_{\sigma \in S_n} \mathrm{sgn}(\sigma) \prod_{i=1}^n a_{i, \sigma(i)}
$$

where $S_n$ is the set of all permutations of $\{1, 2, \dots, n\}$ and $\mathrm{sgn}(\sigma)$ is the sign of the permutation $\sigma$.



### The determinant of a $2 \times 2$ matrix is calculated as:

$$
\det\left(\begin{bmatrix}
a & b \\
c & d
\end{bmatrix}\right) = ad - bc
$$

**Example** Let's compute the determinant of a $2 \times 2$ matrix using an example:

$$
A = \begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}
$$

The determinant is:

$$
\det(A) = (1 \times 4) - (2 \times 3) = 4 - 6 = -2
$$


### For a $3 \times 3$ matrix the general formula is:

$$
\det\left(\begin{bmatrix}
a & b & c \\
d & e & f \\
g & h & i
\end{bmatrix}\right) = aei + bfg + cdh - ceg - bdi - afh
$$

### However, it is often easier to use the Cofactor Expansion formula:

For a $3 \times 3$ matrix:

$$
A = \begin{bmatrix}
a & b & c \\\\
d & e & f \\\\
g & h & i
\end{bmatrix}
$$

We compute the determinant by expanding along the **first row**:

$$
\det(A) =
a \cdot \det\begin{bmatrix} e & f \\\\ h & i \end{bmatrix}
- b \cdot \det\begin{bmatrix} d & f \\\\ g & i \end{bmatrix}
+ c \cdot \det\begin{bmatrix} d & e \\\\ g & h \end{bmatrix}
$$

Each $2 \times 2$ determinant is computed using the rule: $ad - bc$.

Suppose we have:

$$
C = \begin{bmatrix}
1 & 2 & 3 \\
3 & 4 & 5 \\
6 & 7 & 8
\end{bmatrix}
$$


The determinant of $C$ is:
To derive the $3 \times 3$ determinant formula from the definition, expand along the first row:


$$
\det(C) = 1 \cdot
\begin{vmatrix}
4 & 5 \\
7 & 8
\end{vmatrix}
- 2 \cdot
\begin{vmatrix}
3 & 5 \\
6 & 8
\end{vmatrix}
+ 3 \cdot
\begin{vmatrix}
3 & 4 \\
6 & 7
\end{vmatrix}
$$


Each $2 \times 2$ determinant is:

$$
\begin{vmatrix}
a & b \\
c & d
\end{vmatrix}
= ad - bc
$$

So,

$$
\det(C) = 1 \cdot (4 \times 8 - 5 \times 7)
- 2 \cdot (3 \times 8 - 5 \times 6)
+ 3 \cdot (3 \times 7 - 4 \times 6) 
$$

So, the determinant is:

$$
\det(C) = 1 \cdot (4 \times 8 - 5 \times 7)
- 2 \cdot (3 \times 8 - 5 \times 6)
+ 3 \cdot (3 \times 7 - 4 \times 6) 
= 1 \cdot (32 - 35) - 2 \cdot (24 - 30) + 3 \cdot (21 - 24)
= 1 \cdot (-3) - 2 \cdot (-6) + 3 \cdot (-3)
= -3 + 12 - 9 = 0
$$

Therefore, $\det(C) = 0$.



#### Interpretation

- If the determinant is zero, the matrix is singular (not invertible).
- If the determinant is nonzero, the matrix is invertible.

Determinants are important for understanding the properties of transformations in neural networks and for solving systems of equations.


#### Using NumPy

You can compute determinants in code using NumPy:


In [24]:
import numpy as np

C = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

det_C = np.linalg.det(C)
# Print the determinant with higher precision
print(f"Determinant of C: (should be 0)\n{det_C:.130f}")  # Show 130 decimal places
print(f"is close to zero? {np.isclose(det_C, 0)}") # Output: True

Determinant of C: (should be 0)
-0.0000000000000009516197353929940532026773740116004442025297296522956536080073419725522398948669433593750000000000000000000000000000
is close to zero? True


In [25]:
def determinant(matrix):
    # Base case for 2x2 matrix
    if len(matrix) == 2 and len(matrix[0]) == 2:
        # Calculate determinant of 2x2 matrix
        # return the determinant using the formula ad - bc
        # ad - bc for matrix [[a, b], [c, d]]
        # matrix
        # ad = matrix[0][0] * matrix[1][1]
        # bc = matrix[0][1] * matrix[1][0]
        # return matrix[0][0] * matrix[1][1] - matrix[0][1] * matrix[1][0]
        return matrix[0][0]*matrix[1][1] - matrix[0][1]*matrix[1][0]
    # Recursive case for larger matrices
    det = 0
    for col in range(len(matrix)):
        # Build minor matrix
        # minor = [row[:col] + row[col+1:] for row in matrix[1:]]
        # This creates a new matrix excluding the first row and the current column
        # The minor is the matrix obtained by removing the first row and the current column
        # Calculate the cofactor
        # cofactor = ((-1) ** col) * matrix[0][col]
        # Multiply the cofactor by the determinant of the minor
        # det += cofactor * determinant(minor)
        # The determinant is the sum of the cofactors
        # det += ((-1) ** col) * matrix[0][col] * determinant([row[:col] + row[col+1:] for row in matrix[1:]])
        # We iterate over each column in the first row (matrix[0]) because the determinant is expanded along the first row.
        # For each column index 'col', we exclude that column to form the minor matrix,
        # and recursively compute its determinant. This follows the Laplace expansion (cofactor expansion) formula.
        minor = [row[:col] + row[col+1:] for row in matrix[1:]]
        cofactor = ((-1) ** col) * matrix[0][col] * determinant(minor)
        det += cofactor
    return det

# Example usage:
det_C_py = determinant(C.tolist())
print(f"pure python determinant of C: {det_C_py}")

pure python determinant of C: 0


## Invertibility of Matrices

A matrix is **invertible** (also called **nonsingular**) if and only if its determinant is **nonzero**. If the determinant is zero, the matrix is **singular** and does **not** have an inverse.

## Why does the determinant matter?

- **If $\det(A) \neq 0$:** $A$ is invertible (there exists $A^{-1}$ such that $A A^{-1} = I$).
- **If $\det(A) = 0$:** $A$ is not invertible (no $A^{-1}$ exists).

## Example 1: Invertible $2 \times 2$ Matrix

Let
$$
M = \begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}
$$

Compute the determinant:
$$
\det(M) = (1 \times 4) - (2 \times 3) = 4 - 6 = -2
$$

Since $\det(M) \neq 0$, $M$ is invertible.

## Example 2: Singular $3 \times 3$ Matrix

Let
$$
C = \begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{bmatrix}
$$

Compute the determinant (as shown above):
$$
\det(C) = 0
$$

Since $\det(C) = 0$, $C$ is **not** invertible.

## Summary Table

| Matrix | Determinant | Invertible? |
|--------|-------------|-------------|
| $M$    | $-2$        | Yes         |
| $C$    | $0$         | No          |

**In practice:**  
- Always check the determinant before trying to compute a matrix inverse.
- In neural networks and data science, invertibility is important for solving systems of equations and for certain algorithms (e.g., finding unique solutions, matrix decompositions).

## Inverse of a Matrix

The **inverse** of a square matrix $ A $, denoted $ A^{-1} $, is a matrix such that:

$$
A A^{-1} = A^{-1} A = I
$$

where $ I $ is the identity matrix. A matrix is **invertible if and only if** its determinant is nonzero.

### Formula for the Inverse (2×2 Matrix)

For a $ 2 \times 2 $ matrix:

$$
A = \begin{bmatrix}
a & b \\
c & d
\end{bmatrix}
$$

If \( \det(A) = ad - bc \neq 0 \), then:

$$
A^{-1} = \frac{1}{ad - bc}
\begin{bmatrix}
d & -b \\
-c & a
\end{bmatrix}
$$

### Example: Inverse of a 2×2 Matrix

Let

$$
M = \begin{bmatrix}
1 & 2 \\
3 & 4
\end{bmatrix}
$$

Compute:

$$
\det(M) = 1 \cdot 4 - 2 \cdot 3 = -2 \\
M^{-1} = \frac{1}{-2} \begin{bmatrix} 4 & -2 \\ -3 & 1 \end{bmatrix}
= \begin{bmatrix} -2 & 1 \\ 1.5 & -0.5 \end{bmatrix}
$$

Verify:

$$
M^{-1}M = 
\begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix}
= I
$$

### Inverse of a 3×3 Matrix

For larger matrices, the inverse is found using:

1. **Minors** – Determinants of $ 2 \times 2 $ submatrices  
2. **Cofactors** – Apply a checkerboard of signs to the minors  
3. **Adjugate** – Transpose the cofactor matrix  
4. **Inverse** – Divide the adjugate by $ \det(A) $

Let:

$$
A = \begin{bmatrix}
6 & 1 & 1 \\
4 & -2 & 5 \\
2 & 8 & 7
\end{bmatrix}
$$

#### Step 1: Compute Determinant

$$
\det(A) = 6 \cdot (-54) - 1 \cdot 18 + 1 \cdot 36 = -306
$$

#### Step 2: Compute Minors and Cofactors

Minors matrix:

$$
\begin{bmatrix}
-54 & 18 & 36 \\
-1 & 40 & -16 \\
7 & 26 & -16
\end{bmatrix}
$$

Apply checkerboard signs (Cofactors matrix):

$$
\begin{bmatrix}
-54 & -18 & 36 \\
1 & 40 & 16 \\
7 & -26 & -16
\end{bmatrix}
$$

#### Step 3: Transpose to Get Adjugate

$$
\text{adj}(A) = 
\begin{bmatrix}
-54 & 1 & 7 \\
-18 & 40 & -26 \\
36 & 16 & -16
\end{bmatrix}
$$

#### Step 4: Compute Inverse

$$
A^{-1} = \frac{1}{-306}
\begin{bmatrix}
-54 & 1 & 7 \\
-18 & 40 & -26 \\
36 & 16 & -16
\end{bmatrix}
$$

#### Verification: \( A^{-1} A = I \)

To verify:

Let

$$
A^{-1} = \frac{1}{-306}
\begin{bmatrix}
-54 & 1 & 7 \\
-18 & 40 & -26 \\
36 & 16 & -16
\end{bmatrix}, \quad
A = \begin{bmatrix}
6 & 1 & 1 \\
4 & -2 & 5 \\
2 & 8 & 7
\end{bmatrix}
$$

Now compute the matrix product \( A^{-1} A \) (only showing final result):

$$
A^{-1} A =
\frac{1}{-306}
\begin{bmatrix}
-306 & 0 & 0 \\
0 & -306 & 0 \\
0 & 0 & -306
\end{bmatrix}
=
\begin{bmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1
\end{bmatrix}
= I
$$

✅ This confirms that the computed inverse is correct.

In [21]:
# Compute the inverse of A using numpy
A = np.array([[1,2], [3,4]])
A = np.array([[6, 1, 1], [4, -2, 5], [2, 8, 7]])

try:
    print("Matrix A:\n", A)
    inv_A = np.linalg.inv(A)
    print("Inverse of A:\n", inv_A)
    # Verify the inverse by multiplying A and its inverse
    identity = np.dot(A, inv_A)
    print("A * A^-1:\n", identity)
    print("Is A * A^-1 close to identity matrix?", np.allclose(identity, np.eye(A.shape[0])))

except np.linalg.LinAlgError as e:
    print("Matrix A is not invertible:", e)

Matrix A:
 [[ 6  1  1]
 [ 4 -2  5]
 [ 2  8  7]]
Inverse of A:
 [[ 0.17647059 -0.00326797 -0.02287582]
 [ 0.05882353 -0.13071895  0.08496732]
 [-0.11764706  0.1503268   0.05228758]]
A * A^-1:
 [[ 1.00000000e+00  0.00000000e+00 -1.38777878e-17]
 [-8.32667268e-17  1.00000000e+00  8.32667268e-17]
 [-2.77555756e-17  1.11022302e-16  1.00000000e+00]]
Is A * A^-1 close to identity matrix? True



---
## Eigenvalues and Eigenvectors

While not always directly used, understanding eigenvalues and eigenvectors helps in analyzing the stability and dynamics of learning algorithms.

Imagine you have a special kind of transformation—like stretching, squishing, or rotating—applied to vectors (arrows) in space. Most vectors will change direction and length when you do this. But some rare, special vectors only get stretched or squished (their direction stays the same). These are called **eigenvectors**.

The amount by which these special vectors are stretched or squished is called the **eigenvalue**.

- **Eigenvector:** A direction that stays the same after the transformation (except for getting longer or shorter).
- **Eigenvalue:** How much the eigenvector is stretched or squished.

**Example:**  
If you imagine pushing on a door, the axis the door rotates around doesn’t move—it’s like an eigenvector. The amount the door moves (how far it swings open) is like the eigenvalue.

In math, for a matrix $A$, an eigenvector $\mathbf{v}$ and eigenvalue $\lambda$ satisfy:
$$
A\mathbf{v} = \lambda \mathbf{v}
$$

This means applying $A$ to $\mathbf{v}$ just scales it by $\lambda$—the direction doesn’t change. Eigenvalues and eigenvectors help us understand what a matrix (or transformation) really does at its core.

To find the **eigenvalues** $\lambda$ of a square matrix $A$, solve the **characteristic equation**:

$$
\det(A - \lambda I) = 0
$$

where $I$ is the identity matrix of the same size as $A$.

Once you have an eigenvalue $\lambda$, the corresponding **eigenvectors** $\mathbf{v}$ are the nonzero solutions to:

$$
(A - \lambda I)\mathbf{v} = 0
$$

That is, solve the homogeneous system above for $\mathbf{v}$ for each eigenvalue $\lambda$.


### Example: Eigenvalues and Eigenvectors

Suppose we have a square matrix:

$$
M = \begin{bmatrix}
2 & 1 \\
1 & 2
\end{bmatrix}
$$

To find the eigenvalues $\lambda$, solve the characteristic equation:

$$
\det(M - \lambda I) = 0
$$
Where $I$ is the identity matrix:
First, write the matrix $M$ and the identity matrix $I$:

$$
M = \begin{bmatrix}
2 & 1 \\
1 & 2
\end{bmatrix}, \quad
I = \begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix}
$$

Multiply the identity matrix $I$ by the scalar $\lambda$:

$$
\lambda I = \begin{bmatrix}
\lambda & 0 \\
0 & \lambda
\end{bmatrix}
$$

Subtract $\lambda I$ from $M$:

$$
M - \lambda I = \begin{bmatrix}
2 & 1 \\
1 & 2
\end{bmatrix}
-
\begin{bmatrix}
\lambda & 0 \\
0 & \lambda
\end{bmatrix}
=
\begin{bmatrix}
2 - \lambda & 1 \\
1 & 2 - \lambda
\end{bmatrix}
$$

The determinant is:

$$
(2 - \lambda)(2 - \lambda) - (1 \times 1) = (2 - \lambda)^2 - 1 = 0
$$

Expanding:

$$
(2 - \lambda)^2 = 1 \\
2 - \lambda = \pm 1 \\
\lambda_1 = 2 - 1 = 1 \\
\lambda_2 = 2 + 1 = 3
$$

So, the eigenvalues are $\lambda_1 = 1$ and $\lambda_2 = 3$.

To find the eigenvector for $\lambda = 3$:

$$
(M - 3I)\mathbf{v} = 0 \\
\begin{bmatrix}
-1 & 1 \\
1 & -1
\end{bmatrix}
\begin{bmatrix}
v_1 \\
v_2
\end{bmatrix}
= \begin{bmatrix}
0 \\
0
\end{bmatrix}
$$

This gives $-v_1 + v_2 = 0$ and $v_1 - v_2 = 0$, so $v_1 = v_2$.

An eigenvector for $\lambda = 3$ is:

$$
\mathbf{v} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}
$$

Similarly, for $\lambda = 1$:

$$
(M - I)\mathbf{v} = 0 \\
\begin{bmatrix}
1 & 1 \\
1 & 1
\end{bmatrix}

\begin{bmatrix}
v_1 \\
v_2
\end{bmatrix}
= \begin{bmatrix}
0 \\
0
\end{bmatrix}
$$

This gives $v_1 + v_2 = 0$, so $v_1 = -v_2$.

An eigenvector for $\lambda = 1$ is:

$$
\mathbf{v} = \begin{bmatrix} 1 \\ -1 \end{bmatrix}
$$

### Interpretation of Eigenvalues and Eigenvectors

- **Eigenvalues ($\lambda_1 = 1$, $\lambda_2 = 3$):**
    - The eigenvalues represent the amount of variance or "stretching" along their corresponding eigenvector directions.
    - The larger eigenvalue ($3$) indicates the direction along which the data (or transformation) has the greatest effect or variance.
    - The smaller eigenvalue ($1$) indicates a direction with less effect or variance.

- **Eigenvectors ($\begin{bmatrix} 1 \\ 1 \end{bmatrix}$ and $\begin{bmatrix} 1 \\ -1 \end{bmatrix}$):**
    - The eigenvector $\begin{bmatrix} 1 \\ 1 \end{bmatrix}$ (for $\lambda = 3$) points along the line where both variables increase together. In the sensor example, this means both temperature and humidity rise or fall together—this is the main trend in the data.
    - The eigenvector $\begin{bmatrix} 1 \\ -1 \end{bmatrix}$ (for $\lambda = 1$) points along the line where one variable increases as the other decreases. This direction captures the contrast between temperature and humidity.

- **Geometric Meaning:**
    - If you transform a set of points using matrix $M$, points along the $\begin{bmatrix} 1 \\ 1 \end{bmatrix}$ direction will be stretched by a factor of $3$, while points along the $\begin{bmatrix} 1 \\ -1 \end{bmatrix}$ direction will be stretched by a factor of $1$ (i.e., unchanged in length).
    - In PCA, projecting data onto the eigenvector with the largest eigenvalue gives the principal component—showing the direction of maximum variance.

- **Practical Implication:**
    - In neural networks and data analysis, understanding these directions helps in reducing dimensionality, denoising data, and interpreting the underlying structure of datasets.
    - For the given matrix $M$, most of the "action" or information is along the $\begin{bmatrix} 1 \\ 1 \end{bmatrix}$ direction, so focusing on this can simplify analysis without losing much information.

**Summary:**  
- Eigenvalues: $1$ and $3$  
- Eigenvectors: $\begin{bmatrix} 1 \\ -1 \end{bmatrix}$ and $\begin{bmatrix} 1 \\ 1 \end{bmatrix}$

## Real-Life Example: Using These Eigenvalues and Eigenvectors

Suppose you are analyzing a simple network of two sensors measuring temperature and humidity in a room. The readings are correlated, and you want to understand the main directions of variation in your data—this is a classic use case for Principal Component Analysis (PCA).

Given the covariance matrix:
$$
M = \begin{bmatrix}
2 & 1 \\
1 & 2
\end{bmatrix}
$$

- The eigenvalues ($1$ and $3$) tell you the amount of variance captured along each principal direction.
- The eigenvectors ($\begin{bmatrix} 1 \\ -1 \end{bmatrix}$ and $\begin{bmatrix} 1 \\ 1 \end{bmatrix}$) give you the directions in feature space.

**How to use them:**
- The direction $\begin{bmatrix} 1 \\ 1 \end{bmatrix}$ (largest eigenvalue $3$) is the axis along which the data varies most—combining temperature and humidity together.
- The direction $\begin{bmatrix} 1 \\ -1 \end{bmatrix}$ (smaller eigenvalue $1$) is the axis along which the data varies least—contrasting temperature and humidity.

You can project your sensor data onto these axes to reduce dimensionality or to visualize the main trends, helping you identify patterns or anomalies in the room’s climate.

## Summary Table

| Concept         | Neural Network Role                      |
|-----------------|------------------------------------------|
| Vector          | Input, output, weights                   |
| Matrix          | Layer weights, data batches              |
| Dot Product     | Weighted sum in neurons                  |
| Matrix Multiply | Layer transformations                    |
| Transpose       | Aligning dimensions for operations       |
| Determinant     | Invertibility, transformations           |
| Eigenvalues/Eigenvectors | Stability, dynamics analysis         |
| Element-wise    | Activation functions                     |

A solid grasp of these linear algebra concepts will make it easier to understand and implement neural networks.