Numpy Study Notes

# Numpy Study Notes with an emphasis on Large Language Models and Computer Vision applications

Imports

Section 1 . Numpy section one test https://gemini.google.com/share/99dbfcff7f9f

In [2]:
import numpy as np

In [42]:
# 1. np.array(data): Creates an array from a list or tuple.
# Converting a Python list into a high-performance vector.
vector = np.array([1, 2, 3])
print(f"1. Array:\n{vector}\n")

1. Array:
[1 2 3]



In [56]:
MyArray = np.array([[1,2,3], [2,3,4]])
print(MyArray[0][2])

3


In [43]:
# 2. np.zeros(shape): Initializes an array of zeros.
# Useful for padding or mask initialization.
zeros_mask = np.zeros((2, 3))
print(f"2. Zeros (2x3):\n{zeros_mask}\n")

2. Zeros (2x3):
[[0. 0. 0.]
 [0. 0. 0.]]



In [44]:
# 3. np.ones(shape): Initializes an array of ones.
ones_tensor = np.ones((1, 5))
print(f"3. Ones (1x5):\n{ones_tensor}\n")

3. Ones (1x5):
[[1. 1. 1. 1. 1.]]



In [45]:
# 4. np.full(shape, fill_value): Creates a constant array.
# Often used to fill masks with -inf before Softmax.
neg_inf_mask = np.full((2, 2), -np.inf)
print(f"4. Full (-inf):\n{neg_inf_mask}\n")


4. Full (-inf):
[[-inf -inf]
 [-inf -inf]]



In [46]:
# 5. np.eye(N): Creates an identity matrix.
# Often used for initializing orthogonal weights.
identity = np.eye(3)
print(f"5. Identity (3x3):\n{identity}\n")

5. Identity (3x3):
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]



In [47]:
# 6. np.arange(start, stop, step): Returns evenly spaced values.
# Great for positional encoding indices.
indices = np.arange(0, 10, 2)
print(f"6. Arange (0 to 10 step 2):\n{indices}\n")

6. Arange (0 to 10 step 2):
[0 2 4 6 8]



In [48]:
# 7. np.linspace(start, stop, num): num samples over an interval.
linear_space = np.linspace(0, 1, 5)
print(f"7. Linspace (5 points between 0-1):\n{linear_space}\n")

7. Linspace (5 points between 0-1):
[0.   0.25 0.5  0.75 1.  ]



In [49]:
# 8. np.empty(shape): Allocates memory without initializing.
# Fastest for large pre-allocations (contains 'garbage' values).
empty_alloc = np.empty((2, 2))
print(f"8. Empty (Uninitialized):\n{empty_alloc}")

8. Empty (Uninitialized):
[[0.25 0.5 ]
 [0.75 1.  ]]


Here is exactly why itâ€™s dangerous and how it can subtly sabotage your models.
1. It contains "Memory Garbage"When you call np.zeros() or np.ones(), NumPy talks to your Operating System (OS), reserves a block of RAM, and then spends time writing $0$s or $1$s over every single bit in that block.When you call np.empty(), NumPy reserves the RAM but skips the writing step. It just hands you the keys to that memory block exactly as it was left by the previous occupant (another program, a closed browser tab, or a previous variable).

2. The "Silent Failure" BugThe biggest danger is that your code might actually run without throwing an error, but produce nonsensical results.Scenario: You create an array for a mask using np.empty().The Bug: You forget to fill the last row of the mask.The Result: Instead of that row being $0$, it contains a "garbage" value like $6.23 \times 10^{-307}$. In a Neural Network, this tiny value might be treated as a real signal, leading to exploding gradients or "NaN" losses that are nearly impossible to trace back to a single uninitialized row.

3. Non-Deterministic BehaviorBecause np.empty() pulls whatever is currently in your RAM, the "garbage" values change every time you run the script.Run 1: Your code works because the garbage values happened to be very small.Run 2: Your code crashes because the garbage values happened to be massive.This makes debugging a nightmare because you cannot consistently reproduce the failure.

When should you actually use it?Despite the risks, np.empty() exists for a reason: Performance.If you are allocating a massive 10GB tensor that you are immediately going to overwrite (e.g., by loading data from a disk or a camera feed), np.empty() saves you the time it takes to write $0$s to 10GB of memoryâ€”which can take several seconds.Rule of Thumb: Use np.empty() only if the very next line of code is a function that completely fills that array (like np.copyto(), np.random.randn(), or a full-slice assignment).

In [57]:
np.zeros((8, 32, 32))

TypeError: Cannot interpret '32' as a data type

In [58]:
np.eye(5, 3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [59]:
np.eye(5, k = 3)

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

# Section 2 - Shape Manipulation and Transposition

get the gemini questions here https://gemini.google.com/share/1cc643ca3f88

In [61]:
# Sample data: A simulated "batch" of 2 sequences, each with 3 tokens, and 4 embedding dimensions
# Shape: (2, 3, 4) -> (Batch, Seq_Len, Embedding_Dim)
data = np.arange(24).reshape(2, 3, 4)
print(data)
print(data.ndim)

[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
3


In [62]:
reshaped = np.reshape(data, (2, 12))
print(reshaped)

[[ 0  1  2  3  4  5  6  7  8  9 10 11]
 [12 13 14 15 16 17 18 19 20 21 22 23]]


In [64]:
transposed = np.transpose(data, axes = (0, 2, 1))
print(transposed)

[[[ 0  4  8]
  [ 1  5  9]
  [ 2  6 10]
  [ 3  7 11]]

 [[12 16 20]
  [13 17 21]
  [14 18 22]
  [15 19 23]]]


In [65]:
matrix = np.array([[1, 2], [3, 4]])
print(matrix.T)

[[1 3]
 [2 4]]


In [67]:
vector = np.array([1, 2, 3, 4])
expanded1 = np.expand_dims(vector, axis=0)
print(expanded1)
expanded2 = np.expand_dims(vector, axis=1)
print(expanded2)



[[1 2 3 4]]
[[1]
 [2]
 [3]
 [4]]


In [96]:
expanded1.reshape(2, -1)
print(expanded1)
print(expanded1.shape)

[[1 2 3 4]]
(1, 4)


In [70]:
squeezed1 = np.squeeze(expanded1, axis=0)
squeezed2 = np.squeeze(expanded2, axis=1)
print(squeezed1)
print(squeezed2)

[1 2 3 4]
[1 2 3 4]


a[:, None ] Method to add dimensions like newaxis.

np.ascontiguousarray() - called after transpose because CUDA may require contiguis memory placement.

In [88]:
print(squeezed1)
print(squeezed1[:, None])

[1 2 3 4]
[[1]
 [2]
 [3]
 [4]]


In [74]:
a, b = np.ones((2, 2)), np.zeros((2, 2))
concat = np.concatenate((a, b), axis=1)
print(concat)
concat2 = np.concatenate((a, b), axis=0)
print(concat2)

[[1. 1. 0. 0.]
 [1. 1. 0. 0.]]
[[1. 1.]
 [1. 1.]
 [0. 0.]
 [0. 0.]]


New Axis

In [85]:
print(concat2.shape)
concat2[np.newaxis, np.newaxis, :, :].shape

(4, 2)


(1, 1, 4, 2)

In [75]:
stacked = np.stack([a, b], axis=0)
print(stacked)

[[[1. 1.]
  [1. 1.]]

 [[0. 0.]
  [0. 0.]]]


np.stack vs np.concatenate
This is a very common interview question:

np.concatenate: Glues arrays together along an existing axis. (Total dimensions stay the same).

np.stack: Glues arrays together along a new axis. (Total dimensions increase by 1).

In Computer Vision, individual images are usually stored as 3D arrays: $(Height, Width, Channels)$. To train a model, you must "stack" these individual images into a 4D Batch: $(Batch, Height, Width, Channels)$.Here is the code to simulate loading 3 individual images and preparing them for a model:

In [77]:
import numpy as np

# Simulate 3 individual RGB images (32x32 pixels)
img1 = np.random.randint(0, 255, (32, 32, 3))
img2 = np.random.randint(0, 255, (32, 32, 3))
img3 = np.random.randint(0, 255, (32, 32, 3))

print(img1)

print(f"Single image shape: {img1.shape}")

# Stack them into a batch at axis 0
batch = np.stack([img1, img2, img3], axis=0)

print(f"Batch shape: {batch.shape}")
# Output: (3, 32, 32, 3) -> (Batch Size, Height, Width, Channels)

[[[149 101   6]
  [ 14 223 138]
  [ 23 254 194]
  ...
  [ 40  93  56]
  [249 163  63]
  [240  97  15]]

 [[ 87  77  89]
  [ 71  13  30]
  [ 65 111  67]
  ...
  [  4 230 200]
  [116 145 148]
  [ 71 235 243]]

 [[225  73  50]
  [168 248  23]
  [231 151 247]
  ...
  [170  35 166]
  [188 143 121]
  [229  58  95]]

 ...

 [[ 18 147 214]
  [234  26  93]
  [ 84  43 162]
  ...
  [154 187 210]
  [ 48  64 165]
  [ 82 179   0]]

 [[167  80  20]
  [ 85  14  49]
  [230  47 230]
  ...
  [ 29 107  72]
  [108 240 144]
  [ 99  13  18]]

 [[ 70  58 194]
  [ 96  43 159]
  [110  10 105]
  ...
  [196  64  45]
  [ 17 140 152]
  [231   7  73]]]
Single image shape: (32, 32, 3)
Batch shape: (3, 32, 32, 3)


Why axis=0 is critical here:
By choosing axis=0, you ensure that the first number in the shape tuple is the index of the image. When you want to access the second image, you just call batch[1].

If you had used axis=-1 (the last axis), your shape would be (32, 32, 3, 3), which would mean every pixel now has 9 color valuesâ€”a complete spatial mess!

In [78]:
single_val = np.array([[1]])
repeated = np.repeat(single_val, 3, axis=1)
print(repeated)

[[1 1 1]]


In [50]:
import numpy as np

# Sample data: A simulated "batch" of 2 sequences, each with 3 tokens, and 4 embedding dimensions
# Shape: (2, 3, 4) -> (Batch, Seq_Len, Embedding_Dim)
data = np.arange(24).reshape(2, 3, 4)

print("--- 2. Shape Manipulation and Transposition ---")

# 1. np.reshape(a, newshape): Changes shape without changing data.
# Flattening the tokens for a linear layer: (2, 3, 4) -> (2, 12)
reshaped = np.reshape(data, (2, 12))
print(f"1. Reshaped (2x12):\n{reshaped.shape}\n")

# 2. np.transpose(a, axes): Permutes dimensions.
# Swapping Seq_Len and Embedding_Dim for Attention (Batch, Dim, Seq): (2, 4, 3)
transposed = np.transpose(data, (0, 2, 1))
print(f"2. Transposed (0, 2, 1):\n{transposed.shape}\n")

# 3. a.T: Shortcut for 2D transposition.
matrix = np.array([[1, 2], [3, 4]])
print(f"3. Matrix Transpose:\n{matrix.T}\n")

# 4. np.expand_dims(a, axis): Adds a dimension.
# Turning a 1D vector into a "batch of 1": (4,) -> (1, 4)
vector = np.array([1, 2, 3, 4])
expanded = np.expand_dims(vector, axis=0)
print(f"4. Expanded Dims:\n{expanded.shape}\n")

# 5. np.squeeze(a, axis): Removes single-dimensional entries.
# Removing the batch dimension: (1, 4) -> (4,)
squeezed = np.squeeze(expanded, axis=0)
print(f"5. Squeezed:\n{squeezed.shape}\n")

# 6. np.concatenate((a, b), axis): Joins arrays along an existing axis.
# Merging two hidden states side-by-side.
a, b = np.ones((2, 2)), np.zeros((2, 2))
concat = np.concatenate((a, b), axis=1)
print(f"6. Concatenated (axis 1):\n{concat}\n")

# 7. np.stack(arrays, axis): Joins arrays along a NEW axis.
# Creating a batch from individual samples.
stacked = np.stack([a, b], axis=0)
print(f"7. Stacked (axis 0 - results in 2x2x2):\n{stacked.shape}\n")

# 8. np.repeat(a, repeats, axis): Repeats elements.
# Useful for broadcasting a single token mask across the whole sequence.
single_val = np.array([[1]])
repeated = np.repeat(single_val, 3, axis=1)
print(f"8. Repeated (1x3):\n{repeated}")

--- 2. Shape Manipulation and Transposition ---
1. Reshaped (2x12):
(2, 12)

2. Transposed (0, 2, 1):
(2, 4, 3)

3. Matrix Transpose:
[[1 3]
 [2 4]]

4. Expanded Dims:
(1, 4)

5. Squeezed:
(4,)

6. Concatenated (axis 1):
[[1. 1. 0. 0.]
 [1. 1. 0. 0.]]

7. Stacked (axis 0 - results in 2x2x2):
(2, 2, 2)

8. Repeated (1x3):
[[1 1 1]]


In [83]:
batch_1 = np.ones((3, 6 ))
print(batch_1)

batch_1_expand = np.expand_dims(batch_1, axis=1)
print(batch_1_expand)
print(batch_1.shape)
print(batch_1_expand.shape)

[[1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1.]]
[[[1. 1. 1. 1. 1. 1.]]

 [[1. 1. 1. 1. 1. 1.]]

 [[1. 1. 1. 1. 1. 1.]]]
(3, 6)
(3, 1, 6)


# Section 3
Mathematical Operations & Linear Algebra

In [97]:
import numpy as np

# Sample data: Query and Key matrices for a tiny Attention mechanism
# Shape: (Seq_Len, Dim) -> (3, 4)
Q = np.random.randn(3, 4)
K = np.random.randn(3, 4)
V = np.random.randn(3, 4)

print("--- 3. Mathematical Operations & Linear Algebra ---")

# 1. np.matmul(a, b): Matrix product.
# Critical for Attention scores (Q * K.T)
scores = np.matmul(Q, K.T)
print(f"1. Attention Scores Shape (Q * K^T):\n{scores.shape}\n")
print(f"1. Attention Scores (Q * K^T):\n{scores}\n")

# 2. np.dot(a, b): Dot product.
# For 1D it's the inner product; for 2D it's similar to matmul.
dot_prod = np.dot(Q[0], K[0])
print(f"2. Dot product (first token Q and K):\n{dot_prod:.4f}\n")

# 3. np.sum(a, axis, keepdims=True): Summation.
# keepdims is vital for broadcasting during normalization.
row_sum = np.sum(scores, axis=-1, keepdims=True)
print(f"3. Sum (keepdims=True) shape:\n{row_sum.shape}\n")
print(f"3. Sum (keepdims=True):\n{row_sum}\n")

# 4. np.mean / np.std: For Layer Normalization.
mean = np.mean(Q, axis=-1, keepdims=True)
std = np.std(Q, axis=-1, keepdims=True)
print(f"4. Mean shape for LayerNorm:\n{mean.shape}\n")
print(f"4. Mean for LayerNorm:\n{mean}\n")

# 5. np.exp(x): Exponential for Softmax.
# softmax = exp(x) / sum(exp(x))
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True)) # subtracted max for stability
print(f"5. Exp (Softmax numerator):\n{exp_scores[0, :2]}...\n")

# 6. np.log(x): Natural log for Cross-Entropy loss.
probs = np.array([0.1, 0.7, 0.2])
loss = -np.log(probs[1]) # Loss for the correct class (index 1)
print(f"6. Log Loss calculation:\n{loss:.4f}\n")

# 7. np.sqrt(x): Square root for scaling attention.
d_k = Q.shape[-1]
scaled_scores = scores / np.sqrt(d_k)
print(f"7. Scaled scores (first row):\n{scaled_scores[0]}\n")

# 8. np.linalg.norm(x, axis): Vector magnitude.
# Used to calculate Cosine Similarity between embeddings.
norm_Q = np.linalg.norm(Q, axis=-1)
print(f"8. Norm of Query vectors:\n{norm_Q}\n")

# 9. np.einsum: The "Swiss Army Knife" of tensor operations.
# Efficiently performs 'Batch Matrix Multiplication': (Batch, Head, Seq, Dim)
# Below is a simple transpose + multiply: sum over 'j'
einsum_res = np.einsum('ik,jk->ij', Q, K)
print(f"9. Einsum result (matches matmul):\n{np.allclose(scores, einsum_res)}")

--- 3. Mathematical Operations & Linear Algebra ---
1. Attention Scores Shape (Q * K^T):
(3, 3)

1. Attention Scores (Q * K^T):
[[-1.89926456  0.37706613  2.64701485]
 [ 0.74450513  0.98959147  0.36480982]
 [-3.49356701  0.12188847  3.875686  ]]

2. Dot product (first token Q and K):
-1.8993

3. Sum (keepdims=True) shape:
(3, 1)

3. Sum (keepdims=True):
[[1.12481642]
 [2.09890642]
 [0.50400746]]

4. Mean shape for LayerNorm:
(3, 1)

4. Mean for LayerNorm:
[[-0.19153584]
 [-0.14276591]
 [-0.5684074 ]]

5. Exp (Softmax numerator):
[0.01060659 0.10331748]...

6. Log Loss calculation:
0.3567

7. Scaled scores (first row):
[-0.94963228  0.18853307  1.32350742]

8. Norm of Query vectors:
[1.9236944  1.28270811 2.32549765]

9. Einsum result (matches matmul):
True


Why np.einsum is the "Final Boss"
In advanced LLM code, you'll often see np.einsum('bhqd,bhkd->bhqk', Q, K). This looks intimidating, but it's just a shorthand for saying: "Take these 4D tensors and multiply the Query and Key dimensions for every Batch and Head." It's incredibly fast and prevents you from having to do multiple reshape and transpose steps.

Gemini said
In NumPy, axis=-1 is a shorthand that means "the very last dimension." It is incredibly common in AI because it allows you to write code that works regardless of how many dimensions your data has (e.g., whether you are processing a single vector, a matrix, or a 4D batch).

1. The "Counting Backwards" Rule
Just like Python list indexing (where list[-1] is the last item), NumPy axis indexing wraps around:

axis=0: The first dimension (usually Rows or Batch).

axis=1: The second dimension (usually Columns or Sequence).

axis=-1: The last dimension, whatever that may be.

2. Why use -1 for Summation?
In the code snippet you provided, scores usually represents an Attention matrix or a set of Logits.

In a matrix of shape (Rows, Columns), axis=-1 targets the Columns.

When you sum across the columns, you are collapsing the horizontal values into a single total for each row.

Result: You get a "row sum."

3. The "AI Researcher" Reason
In Deep Learning, the last dimension is almost always the "Feature Dimension" (like the 512 numbers that represent a word embedding).

By using axis=-1, your code becomes rank-agnostic.

If you feed it a 2D matrix (Sequence, Features), it sums the features.

If you feed it a 3D batch (Batch, Sequence, Features), it still sums the features.

If you had hard-coded axis=1, the 3D batch would have summed the sequences instead of the features, likely breaking your math.

4. What about keepdims=True?
Notice that the code also uses keepdims=True.

Normally, np.sum on a (10, 5) matrix returns a (10,) vector (it "drops" the summed dimension).

With keepdims=True, it returns (10, 1).

This is "vital for broadcasting" because you can't easily divide a (10, 5) matrix by a (10,) vector. But you can divide a (10, 5) matrix by a (10, 1) matrixâ€”NumPy will automatically stretch that single column to fit all 5 columns.

# Softmax function



In [99]:
# 1. Start with scores: (Batch: 2, Seq: 3)
scores = np.array([[1.0, 2.0, 3.0],
                   [1.0, 1.0, 1.0]])

# 2. Sum along the last axis (the row)
# Without keepdims, shape becomes (2,)
# With keepdims=True, shape stays (2, 1)
row_sums = np.sum(np.exp(scores), axis=-1, keepdims=True)
print(row_sums)

# 3. Divide: (2, 3) / (2, 1)
# Broadcasting "stretches" the (2, 1) to (2, 3) automatically
softmax = np.exp(scores) / row_sums
print(softmax)
print(softmax.shape)

[[30.19287485]
 [ 8.15484549]]
[[0.09003057 0.24472847 0.66524096]
 [0.33333333 0.33333333 0.33333333]]
(2, 3)


The explanation of softmax:
axis=-1: Ensures you are summing the tokens in each sequence, even if you add a "Batch" or "Heads" dimension later.

keepdims=True: Keeps the "1" in the shape so the division "lines up" perfectly across the rows.

Here is the mathematical breakdown of that Softmax operation using the logic of axis=-1 and keepdims=True.1. The Raw Scores ($x$)Imagine a 2D matrix where each row is a sequence of token scores:$$x = \begin{bmatrix} 1 & 2 & 3 \\ 1 & 1 & 1 \end{bmatrix}, \text{with shape } (2, 3)$$2. ExponentiationFirst, we apply the exponential to every element:$$e^x = \begin{bmatrix} e^1 & e^2 & e^3 \\ e^1 & e^1 & e^1 \end{bmatrix} \approx \begin{bmatrix} 2.72 & 7.39 & 20.08 \\ 2.72 & 2.72 & 2.72 \end{bmatrix}$$3. Summation with axis=-1 and keepdims=TrueWe sum along the last axis (the columns). Because of keepdims=True, the result is a column vector instead of a flat list:$$\sum e^x = \begin{bmatrix} 2.72 + 7.39 + 20.08 \\ 2.72 + 2.72 + 2.72 \end{bmatrix} = \begin{bmatrix} 30.19 \\ 8.16 \end{bmatrix}, \text{with shape } (2, 1)$$4. The Division (Broadcasting)To get the final probabilities ($P$), we divide the original $(2, 3)$ matrix by the $(2, 1)$ sum vector. NumPy "broadcasts" (copies) the column vector 3 times to make the shapes match:$$P = \frac{\begin{bmatrix} 2.72 & 7.39 & 20.08 \\ 2.72 & 2.72 & 2.72 \end{bmatrix}}{\begin{bmatrix} 30.19 & 30.19 & 30.19 \\ 8.16 & 8.16 & 8.16 \end{bmatrix}}$$Final Result$$P = \begin{bmatrix} 0.09 & 0.24 & 0.67 \\ 0.33 & 0.33 & 0.33 \end{bmatrix}, \text{with shape } (2, 3)$$Summary of the Logic:The Numerator: $e^{x_i}$The Denominator: $\sum_{j=1}^n e^{x_j}$The Shape Logic: $(2, 3) / (2, 1) \rightarrow (2, 3)$

# Section 4,
Here, we focus on Searching, Sorting, and Logic. These functions are the "decision-makers." You will use them to apply masks (like hiding future tokens in a decoder), find the most likely next word (Argmax), or implement complex sampling strategies like Top-K.

In [52]:
import numpy as np

# Sample data: Logits for a vocabulary of 5 words
logits = np.array([1.2, -0.5, 3.8, 2.1, 0.5])
# Sample attention scores (3x3) for masking
scores = np.random.randn(3, 3)

print("--- 4. Searching, Sorting, and Logic ---")

# 1. np.argmax(a, axis): Find the index of the highest value.
# Used to pick the most likely token from the vocabulary.
predicted_token_id = np.argmax(logits)
print(f"1. Predicted Token ID: {predicted_token_id}\n")

# 2. np.argsort(a, axis): Get indices that would sort the array.
# Essential for Top-K sampling (finding the 'K' most likely tokens).
top_k_indices = np.argsort(logits)[-3:] # Get indices of the top 3
print(f"2. Top-3 Indices (unsorted): {top_k_indices}\n")

# 3. np.where(condition, x, y): The "If-Else" of NumPy.
# Used for Causal Masking: if mask is 0, set score to -inf, else keep score.
mask = np.tril(np.ones((3, 3))) # Lower triangular matrix
masked_scores = np.where(mask == 1, scores, -np.inf)
print(f"3. Masked Attention Scores (Causal):\n{masked_scores}\n")

# 4. np.maximum(x, y): Element-wise maximum.
# This is exactly how the ReLU (Rectified Linear Unit) activation works.
raw_values = np.array([-1, 5, -0.2, 3])
relu_output = np.maximum(0, raw_values)
print(f"4. ReLU Output: {relu_output}\n")

# 5. np.isin(element, test_elements): Membership test.
# Useful for identifying if tokens are "stop words" or special symbols.
vocab_ids = np.array([101, 2045, 300, 102])
special_tokens = [101, 102]
is_special = np.isin(vocab_ids, special_tokens)
print(f"5. Mask for special tokens: {is_special}\n")

# 6. np.unique(ar): Find distinct elements.
# Used to count how many unique tokens are in a generated sequence.
generated_sequence = [5, 5, 2, 8, 2, 1]
unique_tokens = np.unique(generated_sequence)
print(f"6. Unique tokens in sequence: {unique_tokens}")

--- 4. Searching, Sorting, and Logic ---
1. Predicted Token ID: 2

2. Top-3 Indices (unsorted): [0 3 2]

3. Masked Attention Scores (Causal):
[[ 0.29362946        -inf        -inf]
 [-0.16066461 -0.34696028        -inf]
 [ 0.44715133 -0.11761141 -1.26263281]]

4. ReLU Output: [0. 5. 0. 3.]

5. Mask for special tokens: [ True False False  True]

6. Unique tokens in sequence: [1 2 5 8]


# The Importance of np.where .
In Transformer models, the np.where function is your primary tool for Masked Self-Attention. By setting illegal future tokens to -np.inf, you ensure that the Softmax function assigns them a probability of exactly $0$, preventing the model from "cheating" by looking at the answer before it's supposed to.



# Comparison: np.maximum vs. np.max

 This is a very common point of confusion:np.max(x): Returns the single highest value in the whole array (e.g., $4.0$).np.maximum(0, x): Returns a new array of the same shape, where every negative value is "clipped" to zero.

# Section 5
Here, we dive into Random Sampling. This is what gives LLMs their "creativity." Without these functions, a model would be a purely deterministic machine, always picking the single most likely word.

By using random sampling combined with Temperature, we can make the model's output varied and human-like.

In [101]:
import numpy as np

# Seed for reproducibility - vital for debugging!
np.random.seed(42)

print("--- 5. Random Sampling ---")

# 1. np.random.randn(d0, d1): Standard Normal distribution (mean=0, var=1).
# Commonly used for initializing weight matrices (e.g., Xavier or He init).
weights = np.random.randn(4, 4)
print(weights)
print(f"1. Random Weight Matrix (Normal):\n{weights[:2, :2]}...\n")

# 2. np.random.rand(d0, d1): Uniform distribution [0, 1).
# Often used for Dropout layers (generating a mask to drop neurons).
dropout_mask = np.random.rand(1, 5) > 0.1  # 10% dropout
print(f"2. Dropout Mask (90% True):\n{dropout_mask}\n")

# 3. np.random.randint(low, high, size): Random integers.
# Used for sampling random token IDs from a vocabulary or selecting random batches.
random_token_ids = np.random.randint(0, 50000, size=(10,))
print(f"3. Randomly Sampled Token IDs:\n{random_token_ids}\n")

# 4. np.random.choice(a, size, p): Weighted random sampling.
# The core of "Stochastic Decoding." p is the probability distribution from Softmax.
vocab = ['The', 'Cat', 'Sat', 'Mat']
probs = [0.1, 0.5, 0.2, 0.2]  # Must sum to 1.0
next_word = np.random.choice(vocab, size=1, p=probs)
print(f"4. Sampled Word based on probabilities: {next_word}\n")

# 5. Adding "Gumbel" Noise (Advanced):
# LLMs often use Gumbel-Max trick for sampling without replacement.
logits = np.array([2.0, 1.0, 0.1])
noise = np.random.gumbel(0, 1, logits.shape)
print(f"This is the noise: {noise}")
sample_with_noise = np.argmax(logits + noise)
print(f"5. Sampled index with Gumbel noise: {sample_with_noise}")

--- 5. Random Sampling ---
[[ 0.49671415 -0.1382643   0.64768854  1.52302986]
 [-0.23415337 -0.23413696  1.57921282  0.76743473]
 [-0.46947439  0.54256004 -0.46341769 -0.46572975]
 [ 0.24196227 -1.91328024 -1.72491783 -0.56228753]]
1. Random Weight Matrix (Normal):
[[ 0.49671415 -0.1382643 ]
 [-0.23415337 -0.23413696]]...

2. Dropout Mask (90% True):
[[ True  True  True  True  True]]

3. Randomly Sampled Token IDs:
[19118 35773  1899  1267 31551 11394  3556  3890 41606 30740]

4. Sampled Word based on probabilities: ['Cat']

This is the noise: [ 4.31595979 -1.04759856  0.18816036]
5. Sampled index with Gumbel noise: 0


In NumPy, using a comparison operator like the greater-than sign (>) on an array triggers Boolean Broadcasting.
Instead of returning a single True or False, it performs the check on every individual element and returns a new array of the same shape filled with Booleans.
1. How it works (Step-by-Step)Generation: np.random.rand(1, 5) creates 5 random numbers between $0$ and $1$ (e.g., [0.05, 0.82, 0.12, 0.01, 0.95]).Comparison: The > 0.1 is applied to each number.Result:$0.05 > 0.1 \rightarrow$ False$0.82 > 0.1 \rightarrow$ True$0.12 > 0.1 \rightarrow$ True$0.01 > 0.1 \rightarrow$ False$0.95 > 0.1 \rightarrow$ True
2. Why this represents "Dropout"In a Neural Network, Dropout is a technique where you randomly "kill" neurons during training to prevent the model from becoming too reliant on specific connections.The Threshold (0.1): Since rand() is uniform, there is exactly a $10\%$ chance a number will be less than $0.1$.The Mask: By using >, we create a filter where $90\%$ of the values are True (Keep) and $10\%$ are False (Drop).The Application: In a real model, you would multiply your weights by this mask. Because True acts like $1$ and False acts like $0$, the $10\%$ of neurons associated with False values effectively disappear for that training step.3. Quick Pro-Tip: MemoryBoolean masks are incredibly memory-efficient. If you need to save space, you can convert these to integers using .astype(int), turning True/False into 1/0.

# Computer Vision (CV)
NumPy's role shifts slightly in Computer Vision. While LLMs are about sequences and logic, CV is about spatial geometry and channel manipulation. You'll be dealing with 4D tensors: $(Batch, Height, Width, Channels)$.Here is the "CV Research Extension" for your NumPy library.7. Computer Vision Specific OperationsThese functions are essential for image augmentation, color-space shifts, and coordinate transformations.

In [103]:
import numpy as np

# Sample Image: 1 batch, 64x64 pixels, 3 color channels (RGB)
# Shape: (1, 64, 64, 3)
image = np.random.randint(0, 255, (1, 64, 64, 3), dtype=np.uint8)

print("--- 7. Computer Vision Research Operations ---")

# 1. np.flip(a, axis): Reverse array elements.
# Used for "Horizontal Flip" data augmentation.
flipped_img = np.flip(image, axis=2) # axis 2 is Width
print(f"1. Flipped image shape: {flipped_img.shape}\n")

# 2. np.pad(array, pad_width, mode): Padding.
# Critical for constant-size inputs or preparing for convolutions.
padded_img = np.pad(image, ((0,0), (2,2), (2,2), (0,0)), mode='constant')
print(f"2. Padded image (64x64 -> 68x68): {padded_img.shape}\n")

# 3. np.roll(a, shift, axis): Cyclic shift.
# Used in "Shift-Invariant" testing or certain Augmentation techniques.
shifted_img = np.roll(image, shift=10, axis=1)
print(f"3. image shifted 10 pixels down. {shifted_img}\n")

# 4. np.clip(a, a_min, a_max): Limit values.
# Essential after brightness adjustments to keep pixels in [0, 255] range.
bright_img = image.astype(float) + 50
cleaned_img = np.clip(bright_img, 0, 255).astype(np.uint8)
print(f"4. Max value after clipping: {np.max(cleaned_img)}\n")

# 5. np.tile(a, reps): Construct an array by repeating.
# Used to repeat a single-channel mask across 3 RGB channels.
mask = np.random.rand(64, 64, 1) > 0.5
triple_mask = np.tile(mask, (1, 1, 3))
print(f"5. Tiled mask shape: {triple_mask.shape}\n")

# 6. np.moveaxis(a, source, destination): Move axes to specific positions.
# Crucial for switching between (H, W, C) and (C, H, W) for PyTorch compatibility.
pytorch_format = np.moveaxis(image, -1, 1) # Moves channels to index 1
print(f"6. PyTorch format (B, C, H, W): {pytorch_format.shape}\n")

# 7. np.meshgrid(x, y): Coordinate grids.
# Used for creating spatial positional encodings or sampling grids.
x = np.linspace(0, 1, 64)
y = np.linspace(0, 1, 64)
xv, yv = np.meshgrid(x, y)
print(f"7. Meshgrid shape (for coordinate-based models): {xv.shape}\n")

--- 7. Computer Vision Research Operations ---
1. Flipped image shape: (1, 64, 64, 3)

2. Padded image (64x64 -> 68x68): (1, 68, 68, 3)

3. image shifted 10 pixels down. [[[[198 160 111]
   [156  88 223]
   [ 42 222 250]
   ...
   [ 48 144 122]
   [ 88 230 149]
   [ 34 210 110]]

  [[ 26  27 159]
   [ 64 245 128]
   [ 60 218 194]
   ...
   [190 212 167]
   [245  50  44]
   [215 219 136]]

  [[237 235 202]
   [140   2  81]
   [179 167 214]
   ...
   [234 193 186]
   [ 95  53 246]
   [200 176  18]]

  ...

  [[222 224 179]
   [164   2   0]
   [ 74 195 194]
   ...
   [167  72 140]
   [ 45 103   3]
   [ 78 218  97]]

  [[ 75 244  46]
   [166  72 127]
   [ 35  72  76]
   ...
   [193 185 178]
   [192 208 170]
   [ 49  62 223]]

  [[ 52  73 121]
   [ 42 160 138]
   [232   7 169]
   ...
   [245 244 238]
   [139 171  67]
   [226  74  74]]]]

4. Max value after clipping: 255

5. Tiled mask shape: (64, 64, 3)

6. PyTorch format (B, C, H, W): (1, 3, 64, 64)

7. Meshgrid shape (for coordinate-based

To understand np.flip, think of it as a mirror reflection. When you flip an image, you aren't moving the pixels to random places; you are just reversing the order in which they appear along a specific direction.The "Mirror" ExampleImagine a tiny $2 \times 3$ grayscale image (2 pixels high, 3 pixels wide). Each number is a pixel brightness.Pythonimport numpy as np

# A 2D "image" (Height=2, Width=3)
img = np.array([[10, 20, 30],
                [40, 50, 60]])
1. Horizontal Flip (axis=1)If we flip along the Width (the columns), itâ€™s like looking in a mirror. The first pixel in a row becomes the last.Pythonhoriz_flip = np.flip(img, axis=1)
# Result:
# [[30, 20, 10],
#  [60, 50, 40]]
2. Vertical Flip (axis=0)If we flip along the Height (the rows), itâ€™s like a reflection in a lake. The top row moves to the bottom.Pythonvert_flip = np.flip(img, axis=0)
# Result:
# [[40, 50, 60],
#  [10, 20, 30]]
Why the code says axis=2?In Computer Vision research, images are usually 4D tensors:$(Batch, Height, Width, Channels)$Axis 0: Batch (which image in the group)Axis 1: Height (up and down)Axis 2: Width (left and right)Axis 3: Channels (Red, Green, Blue)When the code uses np.flip(image, axis=2), it is telling NumPy: "Go to the Width dimension and reverse the order of the pixels." This creates a Horizontal Flip, which is the most common data augmentation technique because a cat is still a cat whether it's looking left or right!A Quick WarningBe careful not to flip axis=3 (Channels). If you do, your Red pixels might become Blue pixels, and your "Horizontal Flip" will accidentally turn into a "Color Space Swap," which might confuse your model!

# Padding in Computer Vision,
Padding is like adding a "border" around an image. This is usually done so that when a Convolutional Neural Network (CNN) slides a filter over the image, the edges don't get "shrunk" or lost.Here is the breakdown of the three key arguments in your code:1. array (The Image)This is your input data. In your example, it is a 4D tensor with the shape (Batch, Height, Width, Channels).

2. pad_width (The "How Much" Argument)This is often the most confusing part. It is a tuple of tuples, where each inner tuple corresponds to one dimension (axis) of your array.Format: ((before_axis0, after_axis0), (before_axis1, after_axis1), ...)In your code: ((0,0), (2,2), (2,2), (0,0))(0,0) (Batch): Add 0 pixels to the start and end. (Don't add more images).(2,2) (Height): Add 2 rows of pixels to the top and 2 to the bottom.(2,2) (Width): Add 2 columns of pixels to the left and 2 to the right.(0,0) (Channels): Add 0 to the color channels. (Don't add extra colors).Result: Since you added 2 to each side of the Height and Width, a $64 \times 64$ image becomes $68 \times 68$ ($64 + 2 + 2 = 68$).

3. mode='constant' (The "What Color" Argument)This defines what values the new "border" pixels should have.constant: Fills the border with a constant value (default is 0, or "black").reflect: Mirrors the pixels from inside the image (great for keeping textures natural).edge: Stretches the very last pixel of the image outward.Why this is "Critical" for ResearchMaintaining Resolution: Without padding, a $3 \times 3$ convolution filter would turn your $64 \times 64$ image into a $62 \times 62$ image. After many layers, your image would disappear!Edge Detection: Padding ensures the model "sees" the pixels at the very edge of the frame as many times as it sees the pixels in the center.Pro-Tip: Adding a Specific ColorIf you want to pad with white pixels (value 255) instead of black, you add a constant_values argument:Pythonpadded = np.pad(image, ((0,0), (2,2), (2,2), (0,0)),
                mode='constant', constant_values=255)

# np.roll is a circular shift operation.

Unlike padding (which adds new pixels) or cropping (which removes them), rolling takes pixels that fall off one edge and "wraps" them back around to the opposite edge.

Think of it like a rotating conveyor belt or a globe: if you keep walking East, you eventually end up back where you started.

The Basic Example
Imagine a 1D sequence of numbers: [10, 20, 30, 40, 50]

1. Shift by 1 (shift=1)
Everything moves right. The last element (50) wraps around to become the first.

Result: [50, 10, 20, 30, 40]

2. Shift by -1 (shift=-1)
Everything moves left. The first element (10) wraps around to become the last.

Result: [20, 30, 40, 50, 10]

Breaking Down your Code
shifted_img = np.roll(image, shift=10, axis=1)

image: Your input tensor (usually (Height, Width, Channels)).

shift=10: Move the content 10 units forward.

axis=1: Apply this to the vertical axis (Height).

What happens to the image?
The top 10 rows of the image are pushed off the bottom and reappear at the top. If your image was a picture of a person, their head might disappear from the top of the frame and pop up at the bottom of their feet.

Why use this in AI Research?
Shift-Invariance Testing: You want to ensure your model recognizes a "Cat" whether it's in the center or shifted slightly to the edge.

Symmetry in Data: In physics-based ML (like weather patterns), data often "wraps" around the globe, making np.roll a mathematically accurate way to augment that data.

Cyclic Features: Useful for time-series data that repeats, like 24-hour temperature cycles.

# 1. The "Wrap-around" (Overflow) Problem

Most images are stored in 8-bit unsigned integers (uint8). This format only has 256 possible values ($0$ to $255$).If you perform math on uint8 and the result exceeds $255$, it doesn't stop at $255$â€”it wraps around back to $0$.Example: 250 + 10 in uint8 doesn't equal 260. It equals 4 ($260 - 256$).Visual Result: A bright white sky in your photo would suddenly turn pitch black because the values "tripped" over the maximum limit.By converting to float, you move into a number system that can handle values like $260$, $300$, or $10,000$ without breaking. This lets you do your math safely before "clipping" it back down to the $0-255$ range.
# 2. The Arguments in np.clipnp.clip:
It forces every number in an array to stay within a specific boundary.Syntax: np.clip(a, a_min, a_max)a (The Array): The input data you want to restrict (e.g., your brightened float image).a_min (The Floor): The minimum allowed value. Anything smaller than this becomes this value.In images: We usually set this to 0.a_max (The Ceiling): The maximum allowed value. Anything larger than this becomes this value.In images: We usually set this to 255.Basic Logic ExampleImagine an array of "broken" image values after adding brightness:pixels = [10, 280, -5, 125]Python# Force values to stay between 0 and 255
cleaned = np.clip(pixels, 0, 255)

What happens inside:10 stays 10 (It's within bounds).280 becomes 255 (It hit the ceiling).-5 becomes 0 (It hit the floor).125 stays 125.Why np.clip is better than just converting back?If you simply converted 280.0 back to uint8 without clipping, you would get that "wrap-around" error (the pixel would turn into 24). By clipping first, you ensure that "super bright" pixels stay "pure white" ($255$) instead of turning into random dark colors.

# np.tile:
np.tile and np.repeat handle creating a batch of images.Suppose you have one grayscale image $A$ of shape $(2, 2)$ and you want to create a batch of 3 identical images.1. The np.repeat approachnp.repeat works at the element level. If you repeat along the first axis (rows), it duplicates each row before moving to the next.Pythonimport numpy as np
a = np.array([[1, 2],
              [3, 4]])

# Repeat rows 3 times
repeated = np.repeat(a, 3, axis=0)
Result Shape: $(6, 2)$Result: Each row is duplicated 3 times in a row.Python[[1, 2],
 [1, 2],
 [1, 2],
 [3, 4],
 [3, 4],
 [3, 4]]
2. The np.tile approachnp.tile works at the structure level. It treats the entire array as a single "tile" and lays it down repeatedly.Python# Tile the whole block 3 times vertically
tiled = np.tile(a, (3, 1))
Result Shape: $(6, 2)$Result: The entire image pattern repeats 3 times.Python[[1, 2],
 [3, 4],
 [1, 2],
 [3, 4],
 [1, 2],
 [3, 4]]
Summary Table: Creating Identical BatchesGoalBest FunctionReasonDuplicate a mask for RGBnp.tileYou want the whole "mask pattern" copied for each color channel.Magnify an image (Zoom)np.repeatYou want to duplicate each individual pixel to make it look "bigger" (Nearest-Neighbor scaling).Create a batch of 32 identical samplesnp.tileYou want 32 complete copies of the image data, one after another.ðŸš€

Final NumPy Performance Tip

If you are doing this to prepare data for a model, consider using np.broadcast_to instead:Python# Does not use extra RAM; just "pretends" the array is bigger
batched = np.broadcast_to(a, (3, 2, 2))

This is significantly faster and uses less memory because it doesn't actually copy the dataâ€”it just changes the metadata (strides) of the array.

# Important Note:
np.moveaxis is the CV "Secret Weapon"
In Computer Vision, libraries never agree on shape.

OpenCV & TensorFlow: Expect (Height, Width, Channels).

PyTorch: Expect (Channels, Height, Width).

If you try to feed an image into a PyTorch model without using np.moveaxis (or np.transpose), your model will try to treat the width of your image as the color channels, and everything will crash.

Final Area: Data Analysis & Stats
Functions for evaluating your models (both LLM and CV).

In [55]:
# 8. np.percentile(a, q): Compute the q-th percentile.
# Used to find the "95th percentile" of token lengths to set a max_length.
lengths = np.array([10, 15, 200, 12, 18, 500])
max_len = np.percentile(lengths, 95)
print(f"8. 95th Percentile length: {max_len}\n")

# 9. np.nan_to_num(x): Replace NaNs with zero.
# Prevents a single 'broken' weight from crashing your whole training run.
corrupted_data = np.array([1.0, np.nan, np.inf])
clean_data = np.nan_to_num(corrupted_data)
print(f"9. Cleaned NaN/Inf: {clean_data}\n")

# 10. np.savez_compressed('file.npz', a=arr1, b=arr2):
# The professional way to save massive datasets efficiently.
np.savez_compressed('model_data.npz', weights=weights, images=image)
print(f"10. Data saved to compressed .npz format.")

8. 95th Percentile length: 425.0

9. Cleaned NaN/Inf: [1.00000000e+000 0.00000000e+000 1.79769313e+308]

10. Data saved to compressed .npz format.
