# **Session 1: NumPy**

---

## ✅ **NumPy Cheatsheet for ML Engineers**

### 📦 **1. Setup**

```python
import numpy as np
```

---

### 🔧 **2. Array Creation**

```python
np.array([1, 2, 3])            # From list or tuple
np.zeros((3, 4))               # 3x4 array of zeros
np.ones((2, 2))                # 2x2 array of ones
np.eye(3)                      # Identity matrix (3x3)
np.full((2, 3), 7)             # Filled with a constant
np.arange(0, 10, 2)            # [0, 2, 4, 6, 8]
np.linspace(0, 1, 5)           # [0. , 0.25, ..., 1.]
np.random.rand(2, 3)           # Uniform [0, 1)
np.random.randn(2, 3)          # Standard normal
np.random.randint(0, 10, (2,3))# Random ints in range
```

---

### 📏 **3. Array Properties**

```python
a.shape        # Shape (rows, cols)
a.ndim         # Number of dimensions
a.size         # Total number of elements
a.dtype        # Data type
a.itemsize     # Size in bytes of each element
```

---

### ✂️ **4. Indexing & Slicing**

```python
a[0]           # First element
a[-1]          # Last element
a[1:4]         # Slice elements 1 to 3
a[:, 0]        # All rows, 1st column
a[1, :]        # 2nd row, all columns
a[::2]         # Step slicing
```

---

### 🧠 **5. Boolean & Fancy Indexing**

```python
a[a > 0]               # Filter positive values
a[(a > 1) & (a < 5)]   # Combine conditions
np.where(a > 0, 1, 0)  # Ternary condition
```

---

### 🔁 **6. Reshaping & Transforming**

```python
a.reshape(2, 3)        # New shape (same size)
a.ravel()              # Flatten
a.flatten()            # Copy as 1D array
a.T                    # Transpose
a.squeeze()            # Remove singleton dims
a.expand_dims(a, axis=0) # Add a dimension
```

---

### 🧮 **7. Math & Stats**

```python
np.mean(a)             # Mean
np.median(a)           # Median
np.std(a)              # Standard deviation
np.var(a)              # Variance
np.sum(a)              # Sum
np.min(a), np.max(a)   # Min, Max
np.argmax(a), np.argmin(a)  # Index of max/min
np.cumsum(a)           # Cumulative sum
np.diff(a)             # Discrete difference
```

---

### 🤹 **8. Arithmetic & Broadcasting**

```python
a + b                  # Elementwise addition
a - b
a * b
a / b
a ** 2                 # Square elements
np.exp(a)              # e^a
np.log(a)              # Natural log
```

---

### 🧮 **9. Linear Algebra**

```python
np.dot(a, b)           # Dot product
a @ b                  # Matrix multiplication
np.matmul(a, b)        # Same as @
np.linalg.inv(a)       # Inverse
np.linalg.det(a)       # Determinant
np.linalg.eig(a)       # Eigenvalues/vectors
np.trace(a)            # Sum of diagonal
np.linalg.norm(a)      # Vector norm
```

---

### 🧩 **10. Combining & Splitting**

```python
np.concatenate([a, b], axis=0)
np.vstack([a, b])      # Stack vertically
np.hstack([a, b])      # Stack horizontally
np.split(a, 2)
np.array_split(a, 3)
```

---

### 🎲 **11. Random Sampling**

```python
np.random.seed(42)     # Reproducibility
np.random.choice(a)    # Random element
np.random.shuffle(a)   # Shuffle array
```

---

### 🧼 **12. Missing Values & NaNs**

```python
np.isnan(a)            # Check for NaNs
np.nan_to_num(a)       # Replace NaNs with 0
np.nanmean(a)          # Mean ignoring NaNs
```

---

### 📉 **13. Useful Utilities**

```python
np.unique(a)           # Unique values
np.sort(a)             # Sort array
np.argsort(a)          # Indices for sort
np.clip(a, min, max)   # Limit values
np.allclose(a, b)      # Compare arrays
```

---


# **NumPy Interview Problems**

#### **1. Find the Missing Number**

Given a NumPy array with `n` distinct integers from `1` to `n+1` with one number missing, find the missing number.

> ✅ Tests indexing, sum, and set operations

---

In [None]:
import numpy as np
def find_missing(nums: list):
    expected_sum = np.sum(np.arange(1, len(nums) + 2, 1))
    actual_sum = np.sum(nums)
    missing = expected_sum - actual_sum
    return int(missing)

# Doesn't handle edge cases (e.g., empty list) gracefully.
# np.sum() used on Python list — efficient, but a bit overkill unless you're
# working with large arrays or already using NumPy elsewhere.

# Improved code:
def find_missing(nums):
    if not nums:
        return None
    n = len(nums) + 1
    expected_sum = n * (n + 1) // 2
    actual_sum = sum(nums)
    return expected_sum - actual_sum

print(find_missing([1, 2, 3, 4, 6, 7, 8]))
print(find_missing([]))


#### **2. Normalize a Feature Vector**

Given a NumPy array of shape `(n,)`, normalize it to have a mean of 0 and standard deviation of 1.

> ✅ Tests basic stats, broadcasting

---


In [None]:
def normalize(input_arr:list):
    mean = np.mean(input_arr)
    stddev = np.sqrt(np.var(input_arr))
    output = []
    for n in input_arr:
        output.append(int((n - mean) / stddev))
    return output

# int() in normalization step leads to loss of precision.
# Should use vectorized NumPy operations for cleaner code.

# Imporved code:
def normalize(input_arr):
    arr = np.array(input_arr)
    return (arr - arr.mean()) / arr.std() if arr.std() != 0 else np.zeros_like(arr)

print(normalize([5, 10, 3]))


#### **3. Matrix Row with Max Sum**

Given a 2D array, return the index of the row with the maximum sum.

> ✅ Tests `axis`, `sum`, and `argmax`

---


In [None]:
# Assumption: matrix is square
def max_index(matrix):
    row, col = len(matrix), len(matrix[0])
    output = []
    for i in range(row):
        row_sum = 0
        for j in range(col):
            row_sum += matrix[i][j]
        output.append(int(row_sum))
    return f"max index is: {int(np.argmax(output))} and the max value is {int(np.max(output))}"

# Input is a list of lists, not a NumPy array.
# Not efficient: np.sum(matrix, axis=1) would be better.
# Returns a formatted string instead of index/value separately.

# Improved code:
def max_index(matrix):
    matrix = np.array(matrix)
    row_sums = matrix.sum(axis=1)
    idx = row_sums.argmax()
    return idx, row_sums[idx]

matrix = [[2, 3, 4], [4, 5, 6], [7, 8, 9]]
print(max_index(matrix))

matrix = [[2, 3, 4, 5], [3, 6, 7, 8], [9, 3, 2, 1], [3, 5, 2, 7], [2, 1, 2, 1]]
print(max_index(matrix))

#### **4. One-Hot Encoding**

Given an array of class labels like `[0, 2, 1, 3]`, convert to one-hot encoding using NumPy.

> ✅ Tests indexing and broadcasting

---


In [None]:
def one_hot_encoding(labels: list):
    row = len(labels)
    col = max(labels) + 1
    matrix = np.zeros((row, col))
    for l in range(len(labels)):
        matrix[l][labels[l]] = 1
    return matrix

# Improved code:
def one_hot_encoding(labels):
    labels = np.array(labels)
    n_classes = labels.max() + 1
    one_hot = np.zeros((len(labels), n_classes), dtype=int)
    one_hot[np.arange(len(labels)), labels] = 1
    return one_hot

print(one_hot_encoding(labels=[0, 2, 1, 3]))
print(one_hot_encoding(labels=[1, 1, 2, 5, 7, 4]))

#### **5. Detect Outliers Using IQR**

Given a 1D array, detect all elements that are outliers using the IQR method.

> ✅ Tests quantiles, boolean indexing

---


**The Interquartile Range (IQR)** method is a statistical technique used to identify outliers in a dataset. It involves calculating the IQR, which is the range between the first quartile (Q1) and the third quartile (Q3), and then defining upper and lower fences based on 1.5 times the IQR. Data points falling outside these fences are considered outliers.

Here's a step-by-step breakdown:

1. Calculate the quartiles:
- Sort the data in ascending order.
- Find the median (Q2), which is the middle value.
- Q1 is the median of the lower half of the data (excluding Q2).
- Q3 is the median of the upper half of the data (excluding Q2).
2. Calculate the IQR:
- IQR = Q3 - Q1.
3. Determine the outlier fences:
- Lower Fence: Q1 - (1.5 * IQR).
- Upper Fence: Q3 + (1.5 * IQR).
4. Identify outliers:
Any data point below the lower fence or above the upper fence is considered an outlier.

In [None]:
def find_outliers(arr):
    q1 = np.quantile(arr, 0.25)
    q3 = np.quantile(arr, 0.75)
    median = np.median(arr)

    iqr = q3 - q1
    lower_fence = q1 - (1.5*iqr)
    upper_fence = q3 + (1.5*iqr)
    outliers = [i for i in arr if i < lower_fence or i > upper_fence]
    return outliers

arr = [1, 5, 7, 8, 10, 12, 15, 18, 20, 22, 25, 28, 45]
print(find_outliers(arr))

arr = [-11, 3, 5, 7, 9, 3, 7, 4, 90]
print(find_outliers(arr))

#### **6. Implement a Softmax Function**

Implement the softmax activation function using NumPy.

> ✅ Tests exponentials, broadcasting, stability

---



In [None]:
def softmax(arr):
    activations = []
    denominator = np.sum([np.exp(i) for i in arr])
    activations = [float(np.exp(i)/ denominator) for i in arr]
    return activations

# Uses np.exp directly, which can cause overflow for large numbers.
# Not vectorized.

def softmax(arr):
    arr = np.array(arr)
    exp_values = np.exp(arr - np.max(arr))  # numerical stability
    return exp_values / exp_values.sum()

print(softmax([5, 7, 10]))

#### **7. Flatten and Reconstruct a Matrix**

Flatten a 2D matrix into a 1D vector and reshape it back to original shape. Verify that reconstruction is correct.

> ✅ Tests `reshape`, `flatten`, and `array equality`

---


In [None]:
def reshape_array(arr):
    arr = np.array(arr)
    arr_flat = arr.ravel()
    arr_reshaped = arr_flat.reshape(len(arr), len(arr[0]))
    return arr_flat, arr_reshaped

# Improved code:
def reshape_array(arr):
    arr = np.array(arr)
    arr_flat = arr.ravel()
    arr_reshaped = arr_flat.reshape(arr.shape)
    return arr_flat, arr_reshaped

print(reshape_array(arr=[[1, 2, 3], [6, 7, 8]]))


#### **8. Compute Pairwise Euclidean Distance**

Given two sets of vectors `A` and `B`, compute the full pairwise distance matrix without using loops.

> ✅ Tests broadcasting and vectorization

---

In [None]:
def compute_distance(vec_a, vec_b):
    d = np.sqrt(np.power((vec_b - vec_a), 2))
    return d

a = np.array((1, 4, 5, 8))
b = np.array((3, 5, 7, 9))
print(compute_distance(a, b))

# Computes element-wise difference, not full Euclidean distance.
# Incorrect if goal is scalar distance between two vectors.

# Correct code:
def compute_distance(vec_a, vec_b):
    return np.linalg.norm(vec_b - vec_a)

print(compute_distance(a, b))


#### **9. Shuffle Rows Independently**

Given a 2D array, shuffle each row **independently**.

> ✅ Tests in-place modification and random logic

---

In [None]:
def shuffle_row_content(matrix):
    rows = len(matrix)
    cols = len(matrix[0])
    output = []
    for r in range(rows):
        row = []
        for c in range(cols):
            row.append(int(matrix[r][c]))
            np.random.shuffle(row)
        output.append(row)
    return output

# Logic is broken: it shuffles row incrementally inside a column loop.
# np.random.shuffle(row) should happen outside the inner loop.

# Imporved code:
def shuffle_row_content(matrix):
    matrix = np.array(matrix)
    return np.array([np.random.permutation(row) for row in matrix])

arr_2 = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print(shuffle_row_content(arr_2))

#### **10. Apply Custom Function Across Columns**

Given a 2D array, apply a custom function (e.g., `range = max - min`) to each column.

> ✅ Tests `axis`, `apply_along_axis`
---

In [None]:
def custom_function(matrix):
    rows, cols = len(matrix), len(matrix[0])
    output = []
    for i in range(cols):
        col = matrix[:, i]
        output.append(float(np.max(col) - np.min(col)))
    return output

arr = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

# Redundant conversion from list to array.
# Better to vectorize using axis=0.

# Imporved code:
def custom_function(matrix):
    matrix = np.array(matrix)
    return (matrix.max(axis=0) - matrix.min(axis=0)).tolist()

print(custom_function(arr))

#### **11. Sum the Alphabet Values**
Given a list of strings made up of lowercase or uppercase letters from a to z, return a list of the alphabet sum of each word.

The alphabet sum is defined as the sum of the ordinal position of each letter in the English alphabet.
For example, a = 1, b = 2, ..., z = 26.
So "sport" has an alphabet sum of 19 + 16 + 15 + 18 + 20 = 88.

> ✅ Tests string handling, character mapping, and list comprehension



In [None]:
def sum_alphabet(words):
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    alphabet_dict = {}
    for i in range(1, len(alphabet)+1):
        alphabet_dict[alphabet[i-1]] = i

    output = []
    for word in words:
        sum_value = 0
        word.strip().lower()
        for letter in word:
            if letter.lower() in alphabet_dict:
                sum_value += alphabet_dict[letter.lower()]
        output.append(sum_value)
    return output

# word.strip().lower() has no effect since it's not reassigned.
# The try block is unnecessary and incorrectly used (except doesn’t handle anything properly).
# Minor inefficiencies in repeatedly calling letter.lower() inside the loop.

# Improved code:
def sum_alphabet(words):
    alphabet_dict = {char: idx + 1 for idx, char in enumerate('abcdefghijklmnopqrstuvwxyz')}
    output = []
    for word in words:
        word = word.strip().lower()
        sum_value = sum(alphabet_dict.get(ch, 0) for ch in word)
        output.append(sum_value)
    return output

words = ["sport", "Good", "bAd", " ", "%cat"]
print(sum_alphabet(words))

# Mastering NumPy for ML Engineering: Tutorial for Going from Intermediate to Advanced

This tutorial covers the key areas to improve from intermediate to advanced-level proficiency in NumPy. Each section includes:

* **Concept explanation**
* **Why it matters**
* **Common pitfalls**
* **Best practices**
* **Examples**

---

## 1. Vectorization and Broadcasting

### 🔎 Concept:

Vectorization is replacing explicit Python loops with NumPy operations that run in compiled code for better performance. Broadcasting lets NumPy perform operations on arrays of different shapes.

### ⚠️ Why It Matters:

* Speeds up code dramatically
* Reduces memory footprint
* Leads to cleaner, more readable code

### 🔧 Usage Tips:

* Use `np.where`, `np.sum`, `np.mean`, `np.dot`, etc. instead of manual loops.
* Understand shape rules of broadcasting.

### 🔢 Example:

```python
# BAD: Loop-based
output = []
for x in arr:
    output.append(x ** 2)

# GOOD: Vectorized
output = arr ** 2
```

#### Broadcasting:

```python
# Broadcasting a scalar
arr = np.array([1, 2, 3])
arr + 5  # [6, 7, 8]

# Broadcasting a row vector across rows
matrix = np.array([[1, 2, 3], [4, 5, 6]])
row_mean = matrix.mean(axis=1, keepdims=True)
normalized = matrix - row_mean
```

---

## 2. Numerical Stability

### 🔎 Concept:

Certain mathematical operations (like softmax) can cause **overflow or underflow**. Overflow means a number is too large to be represented (e.g., `np.exp(1000)` → `inf`). Underflow means a number is too small (close to zero), and might be rounded down to `0.0`.

### ⚠️ Why It Matters:

Unstable code may give `inf`, `NaN`, or misleading values in real ML pipelines.

### 🔧 Best Practices:

* Subtract max before applying `exp` in softmax.
* Use log-space tricks (`logsumexp`, etc.)

### 🔢 Example:

```python
def stable_softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()
```

---

## 3. Defensive Programming

### 🔎 Concept:

Write functions that don't crash on edge cases like empty arrays, NaNs, or zero-division.

### ⚠️ Why It Matters:

In production or research, you can't afford silent failures.

### 🔧 Best Practices:

* Check array length before division.
* Use `np.isnan`, `np.isinf`, `np.any`, `np.all` to inspect array health.
* Return default values where appropriate.

### 🔢 Example:

```python
def normalize(x):
    x = np.array(x)
    if x.std() == 0:
        return np.zeros_like(x)
    return (x - x.mean()) / x.std()
```

---

## 4. Efficient Indexing and Masking

### 🔎 Concept:

Use Boolean indexing to select or manipulate data without loops. Boolean indexing means using a boolean array (of `True`/`False` values) to select elements of another array. The elements corresponding to `True` are kept.

### ⚠️ Why It Matters:

Masking makes filtering data simple and fast.

### 🔧 Usage Tips:

* Combine masks using `&`, `|` with parentheses.
* Use `np.where` for ternary conditions.

### 🔢 Example:

```python
arr = np.array([1, 5, 7, 10])
mask = arr > 5
arr[mask]  # [7, 10]

np.where(arr > 5, 1, 0)  # [0, 0, 1, 1]
```

---

## 5. Array Manipulation Mastery

### 🔎 Concept:

Know when and how to reshape, flatten, expand or squeeze dimensions.

### ⚠️ Why It Matters:

ML input pipelines often require reshaped or batched data.

### 🔧 Tips:

* Use `.reshape()`, `.flatten()`, `.squeeze()`, `np.expand_dims()`
* Avoid reshaping that breaks data order

### 🔢 Example:

```python
x = np.array([[1, 2], [3, 4]])
x.reshape(-1)         # [1 2 3 4]
x.reshape(1, 4)        # [[1 2 3 4]]
x.flatten()           # [1 2 3 4] (copy)
x.ravel()             # view (if possible)
```

---

## 6. Advanced Aggregations and Stats

### 🔎 Concept:

Use axis-based operations and aggregation functions effectively.

### ⚠️ Why It Matters:

Summarizing data across rows or columns is key for ML feature engineering.

### 🔧 Tips:

* Learn `np.sum`, `np.mean`, `np.std`, `np.median`, `np.quantile`, `np.argmax`, `np.bincount`
* Use `axis=` parameter wisely

### 🔢 Example:

```python
matrix = np.array([[1, 2, 3], [4, 5, 6]])
matrix.mean(axis=0)  # mean per column
matrix.sum(axis=1)   # sum per row
```

---

## 7. Randomness and Reproducibility

### 🔎 Concept:

Use NumPy's `random` module for consistent random sampling.

### ⚠️ Why It Matters:

In experiments, reproducibility is essential.

### 🔧 Tips:

* Use `np.random.seed()` before randomness
* Use `np.random.permutation`, `shuffle`, `choice`, `randn`, `randint`

### 🔢 Example:

```python
np.random.seed(42)
np.random.choice([1, 2, 3], size=2, replace=False)
```

---

## Final Tips:

* Practice refactoring your Python loops into vectorized NumPy equivalents.
* Write unit tests for your NumPy code.
* Benchmark performance using `%timeit` in IPython.
* Combine with Pandas and Scikit-learn for full ML workflows.

---