# Table of Contents

1. Boolean Indexing (Advanced)
2. Fancy Indexing
3. Sorting & Searching
   - np.sort()
   - np.argsort()
   - np.where()
   - np.searchsorted()
4. Unique & Set Operations
   - np.unique()
   - return_counts
   - set operations
5. Advanced Broadcasting
6. Memory Layout & Efficiency
   - C-order vs F-order
   - ascontiguousarray()
7. Vectorization Techniques
8. Normalization using Broadcasting
9. Random Shufmutation (optional)
10. Summary


# 1. Boolean Indexing (Advanced)
### What it is?

Boolean indexing allows you to filter arrays using conditions, and even apply multiple conditions together.

### Why it’s important?

Used heavily in:
- data cleaning
- selecting records based on conditions
- modifying datasets
- ML preprocessing

In [2]:
import numpy as np

arr = np.array([10, 20, 30, 40, 50])
arr[arr > 25]

array([30, 40, 50])

#### Multiple Conditions

Use &, |, ~ (NOT):

In [4]:
arr[(arr > 15) & (arr < 45)]

array([20, 30, 40])

#### Replacing values using boolean masks

In [6]:
arr[arr > 30] = 999

In [7]:
arr

array([ 10,  20,  30, 999, 999])

#### Boolean mask array

In [8]:
mask = arr % 2 == 0
arr[mask]

array([10, 20, 30])

# 2. Fancy Indexing
### What it is?

Fancy indexing lets you select elements using arrays/lists of indices, not just slices.

### Why it’s useful?

Used for:
- selecting arbitrary rows/columns
- reordering data
- selecting random samples for ML

#### Selecting specific elements

In [9]:
a = np.array([10, 20, 30, 40, 50])
a[[0, 2, 4]]

array([10, 30, 50])

#### Fancy indexing on 2D arrays

In [10]:
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
A[[0, 2]]    # select rows 0 and 2

array([[1, 2, 3],
       [7, 8, 9]])

#### Select specific columns

In [12]:
A[:, [0, 2]]    # select columns for 0 and 2

array([[1, 3],
       [4, 6],
       [7, 9]])

#### Fancy indexing with rows & columns

In [13]:
A[[0, 2], [1, 2]]

array([2, 9])

# 3. Sorting & Searching in NumPy

Sorting and searching are essential for data preprocessing, cleaning, ranking, indexing, and ML tasks.

#### 3.1 **np.sort() — Returns a Sorted Copy**

Sorts an array without modifying the original.

In [15]:
arr = np.array([40, 10, 50, 20, 30])
np.sort(arr)

array([10, 20, 30, 40, 50])

#### 3.2 **np.sort() on 2D arrays**

Sorts row-wise by default.

In [16]:
A = np.array([[3, 2, 4], [9, 6 ,7]])
np.sort(A)

array([[2, 3, 4],
       [6, 7, 9]])

#### 3.3 **np.argsort() — Returns Sorted Index Positions**

Used often in ML for ranking values.

In [18]:
arr = np.array([50, 20, 10, 40])
np.argsort(arr)

array([2, 1, 3, 0], dtype=int64)

Meaning:
sorted order → values at indices 2, 1, 3, 0

#### 3.4 **np.where() — Find Positions Matching a Condition**

In [19]:
arr = np.array([10, 20, 30, 40, 50])
np.where(arr > 30)

(array([3, 4], dtype=int64),)

Useful for:
- Finding row indices
- Replacing values
- Conditional operations

#### 3.5 **np.searchsorted() — Find Insertion Position**
Useful in:
- Binary search
- Keeping arrays sorted

In [21]:
arr = np.array([10, 20, 30, 40])
np.searchsorted(arr, 25)

2

Means:
25 should be inserted at index 2 to keep array sorted.

# 4. Unique & Set Operations

These functions help you work with categorical data, remove duplicates, find frequencies, and compare datasets.

Extremely useful in:
- Data cleaning
- Feature engineering
- Finding unique categories
- Comparing two arrays

#### 4.1 **np.unique() — Get Unique Values**

In [22]:
arr = np.array([1, 2, 2, 3, 3, 3, 4])
np.unique(arr)

array([1, 2, 3, 4])

#### 4.2 **Count Occurrences of Each Value**

Use return_counts=True.

In [23]:
value, counts = np.unique(arr, return_counts = True)
print(value)
print(counts)

[1 2 3 4]
[1 2 3 1]


Very useful for:
- categorical analysis
- bar charts
- class distribution in ML

#### 4.3 **np.intersect1d() — Intersection**

Find common elements between two arrays.

In [24]:
a = np.array([1, 2, 3])
b = np.array([3, 4, 5])

np.intersect1d(a, b)

array([3])

#### 4.4 **np.union1d() — Union**

All unique values from both arrays.

In [26]:
np.union1d(a, b)

array([1, 2, 3, 4, 5])

#### 4.5 **np.setdiff1d() — Difference**

Values in a that are not in b.

In [27]:
np.setdiff1d(a, b)

array([1, 2])

In [28]:
# values that are in b but not in a 
np.setdiff1d(b, a)

array([4, 5])

#### 4.6 **np.setxor1d() — Symmetric Difference**

Values not common in both arrays.

In [30]:
np.setxor1d(a, b)

array([1, 2, 4, 5])

# 5. Advanced Broadcasting

Broadcasting lets NumPy perform operations between arrays of different shapes by automatically expanding the smaller array.

You already learned the basics — now here are the important advanced cases used in Data Analysis & ML.

#### 5.1 **Broadcasting with Row Vector vs Column Vector**
- Row vector shape → (1, n)
- Column vector shape → (n, 1)

In [31]:
row = np.array([[1, 2, 3]])
col = np.array([[10], [20], [30]])
row+col

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

NumPy expands:
- row → downward
- column → rightward

#### 5.2 **Broadcasting for Normalization**
Normalize each column of a dataset:

In [34]:
X = np.array([1, 2, 3])
X_norm = (X - X.mean(axis = 0)) / X.std(axis = 0)
X_norm

array([-1.22474487,  0.        ,  1.22474487])

- X.mean(axis=0) → shape (n_features,)
- Broadcast across all rows

This is used in:
- Linear regression
- Logistic regression
- Neural networks
- PCA

#### 5.3 **Broadcasting 1D → 2D or 3D**

Example:

In [35]:
a = np.array([1, 2, 3])
A = np.ones((4, 3))     # shape (4, 3)

A + a

array([[2., 3., 4.],
       [2., 3., 4.],
       [2., 3., 4.],
       [2., 3., 4.]])

a is expanded along rows.

#### 5.4 **Broadcasting with Scalars (Always Allowed)**

In [37]:
a*10

array([10, 20, 30])

In [38]:
a + 5

array([6, 7, 8])

Scalar expands to all elements.

#### 5.5 **Broadcasting Fails When Shapes Are Incompatible**

Example:

In [39]:
a = np.array([1, 2, 3])
b = np.array([1, 2])

a + b    # error

ValueError: operands could not be broadcast together with shapes (3,) (2,) 

Shapes (3, ) and (2, ) cannot match in any dimension.

# 6. Memory Layout & Efficiency (C-order vs F-order)

NumPy stores arrays in memory in two major ways, and understanding this helps with:
- Speed
- Compatibility with ML libraries
- Efficient reshaping
- Faster computations

#### 6.1 **C-order (Row-major)**

This is NumPy’s default.
- Data stored row by row
- The last axis changes fastest
- Mostly used in Python, C, scikit-learn, PyTorch

In [41]:
arr = np.array([[1, 2, 3], [4, 5, 6]], order = 'C')
arr

array([[1, 2, 3],
       [4, 5, 6]])

Faster for row-wise iteration

#### 6.2 **F-order (Column-major)**

- Data stored column by column
- The first axis changes fastest
- Compatible with MATLAB, Fortran

In [43]:
arr = np.array([[1, 2, 3], [4, 5, 6]], order = 'F')
arr

array([[1, 2, 3],
       [4, 5, 6]])

Useful for linear algebra & scientific libraries

#### 6.3 **Checking Memory Layout**

In [44]:
arr.flags

  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

You will see:
- C_CONTIGUOUS : True
- F_CONTIGUOUS : False

or vice versa.

#### 6.4 **Forcing C-order or F-order**

In [45]:
a_c = np.ascontiguousarray(arr)   # C-order
a_f = np.asfortranarray(arr)      # F-order

Used heavily in ML frameworks that require C-contiguous memory.

#### 6.5 **Why This Matters in ML**

Many ML frameworks expect arrays in a particular memory layout:
- PyTorch → prefers C-order
- TensorFlow → prefers C-order
- Numba/JIT optimization → best performance with C-contiguous arrays
- SciPy linear algebra → sometimes prefers F-order

Example:

In [47]:
X = np.ascontiguousarray(X)

This ensures the array is fast and compatible.

#### 6.6 **Reshape Efficiency Tip**

reshape() doesn’t copy data if memory is contiguous.

Otherwise, it must create a copy, which is slower.

Example:

In [48]:
a = np.arange(1000000)
b = a.reshape(1000, 1000)    # very fast (no copy)

But after slicing:

In [49]:
c = a[100:500]
d = c.reshape(100, 4)        # slower (copy required)

Because slicing breaks contiguity.

# 7. Vectorization Techniques
Vectorization means performing operations on entire arrays without using Python loops.
This makes your code 10x to 100x faster.

NumPy achieves this because operations run in optimized C code under the hood.

#### 7.1 **Why loops are slow in Python**

Python loops process elements one by one, with heavy overhead.

Example (slow):

In [50]:
arr = np.arange(1000000)
result = []

for x in arr:
    result.append(x * 2)

#### 7.2 **Vectorized replacement (fast)**
- Runs in C
- No Python overhead
- Works instantly

In [51]:
arr = np.arange(1000000)
result = arr * 2

#### 7.3 **Vectorized Mathematical Operations**
Instead of:

In [56]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

c = np.zeros_like(a)

for i in range(len(a)):
    c[i] = a[i] + b[i]
c

array([11, 22, 33, 44, 55])

Use:

In [58]:
c = a + b
c

array([11, 22, 33, 44, 55])

Works for all operations:

- +, -, *, /
- np.power()
- np.abs()
- np.exp()
- np.log()
- np.sqrt()

#### 7.4 **Vectorized Conditional Operations (Avoiding if/else loops)**

Instead of:

In [59]:
for i in range(len(arr)):
    if arr[i] < 0:
        arr[i] = 0

Use boolean indexing:

In [60]:
arr[arr < 0] = 0

#### 7.5 **Vectorized Functions on Entire Arrays (ufuncs)**

Universal functions apply to every element automatically:

#### 7.6 **Using np.where() for vectorized conditional choices**

Instead of:

In [61]:
result = []
for x in arr:
    result.append(0 if x < 0 else x)

Use:

In [62]:
result = np.where(arr < 0, 0, arr)

#### 7.7 **Broadcasting + Vectorization**

This allows entire dataset transformations:
- No loops
- Works column-wise
- Standard ML normalization

# 8. Normalization Using Broadcasting

Normalization is a preprocessing step used in Machine Learning, Data Analysis, and Deep Learning to scale features.

Broadcasting makes normalization clean and vectorized — no loops needed.

#### 8.1 **Column-wise Normalization (Most Common)**

Normalize each feature (column):

$$
X_{norm} = \frac{X - \text{mean}(X)}{\text{std}(X)}
$$


In [63]:
# Example dataset (3 samples, 3 features)
X = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

X_norm = (X - X.mean(axis=0)) / X.std(axis=0)
X_norm

array([[-1.22474487, -1.22474487, -1.22474487],
       [ 0.        ,  0.        ,  0.        ],
       [ 1.22474487,  1.22474487,  1.22474487]])

What broadcasting does:
- X.mean(axis=0) → shape (3,)
- Expands to (3, 3)
- Subtracts mean column-wise
- Divides std column-wise

This is used in:
- Linear Regression
- Neural Networks
- Logistic Regression
- PCA

#### 8.2 **Row-wise Normalization (Less Common)**

Normalize each row:

$$
X_{norm} = \frac{X - \text{mean}(X_{row})}{\text{std}(X_{row})}
$$


In [65]:
X_norm = (X - X.mean(axis=1)[:, None]) / X.std(axis=1)[:, None]
X_norm

array([[-1.22474487,  0.        ,  1.22474487],
       [-1.22474487,  0.        ,  1.22474487],
       [-1.22474487,  0.        ,  1.22474487]])

Here:
- [:, None] converts shape (3,) → (3,1)
- Broadcasting makes it work row-wise

#### 8.3 **Min–Max Normalization**

Convert values to range [0, 1]:

$$
X_{minmax} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$$


In [66]:
X_minmax = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

#### 8.4 **Standardization Example with Real Data**

In [67]:
X = np.random.randint(1, 100, size=(5, 4))

X_standard = (X - X.mean(axis=0)) / X.std(axis=0)
X_standard

array([[-0.66968   ,  0.28609297, -0.33527382, -0.45913113],
       [-0.79603472, -0.23046378, -0.5558487 ,  1.04347984],
       [ 0.2463917 , -1.3033124 ,  1.91458995,  1.08521903],
       [ 1.85741434, -0.46887458, -0.95288348, -1.58608936],
       [-0.63809132,  1.7165578 , -0.07058396, -0.08347839]])

# 9. Random Shuffling & Permutation

NumPy’s random module provides tools to shuffle and reorder arrays without writing loops.

These operations are essential in ML for:
- Shuffling datasets
- Creating random train/test splits
- Randomizing rows
- Sampling without replacement

#### 9.1 **np.random.shuffle() — Shuffles the array in place**

This modifies the original array permanently.

Example:

In [68]:
arr = np.array([1, 2, 3, 4, 5])
np.random.shuffle(arr)
arr

array([4, 1, 2, 5, 3])

#### 9.2 **np.random.permutation() — Returns a new shuffled array**

Unlike shuffle(), this does NOT modify the original array.

In [69]:
arr = np.array([1, 2, 3, 4, 5])
np.random.permutation(arr)

array([3, 4, 2, 1, 5])

In [70]:
arr

array([1, 2, 3, 4, 5])

#### 9.3 **Shuffling Rows of a 2D Dataset**

Usually used in ML before training models:

In [71]:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
np.random.shuffle(X)
X

array([[5, 6],
       [1, 2],
       [3, 4],
       [7, 8]])

- Rows are randomly reordered
- Columns remain the same

#### 9.4 **Getting Random Index Order**

In [72]:
idx = np.random.permutation(len(X))
X[idx]

array([[3, 4],
       [1, 2],
       [7, 8],
       [5, 6]])

This gives you:
- Random sampling without replacement
- Works for both X and y:

This is EXACTLY how ML datasets are shuffled internally.

#### 9.5 **Random Sampling (choose without replacement)**

In [76]:
np.random.choice(arr, size=3, replace=False)

array([3, 5, 2])

# Summary

### 1. Boolean Indexing (Advanced) 

- Use boolean conditions to filter arrays.
- Multiple conditions use `&`, `|`, and `~`.
- Boolean masks allow complex filtering.
- You can modify values using boolean masks:
  `arr[arr > 10] = 0`
- Very useful for data cleaning and preprocessing.


### 2. Fancy Indexing 

- Select elements using lists or arrays of indices.
- Works for selecting arbitrary rows or columns.
- `A[[0, 2]]` → selects rows 0 and 2.
- `A[:, [1, 3]]` → selects columns 1 and 3.
- Useful for reordering data or selecting random samples.


### 3. Sorting & Searching 

- `np.sort()` → returns sorted copy (row-wise for 2D).
- `np.argsort()` → returns indices that would sort the array.
- `np.where(condition)` → returns positions matching the condition.
- `np.searchsorted()` → finds insertion index for sorted order.


### 4. Unique & Set Operations 

- `np.unique()` → all unique values  
- `return_counts=True` → count frequencies  
- `np.intersect1d(a,b)` → common values  
- `np.union1d(a,b)` → all values (unique)  
- `np.setdiff1d(a,b)` → values in a but not in b  
- `np.setxor1d(a,b)` → values not shared by both  


### 5. Advanced Broadcasting 

- NumPy expands smaller arrays to match larger shapes.
- Column vector `(n,1)` and row vector `(1,n)` broadcast to `(n,n)`.
- Used in ML: `(X - mean) / std`.
- 1D arrays broadcast to 2D or 3D along matching dimensions.
- Scalars broadcast to all elements.
- Fails when shapes mismatch and neither is 1.


### 6. Memory Layout & Efficiency 

- NumPy stores arrays in memory as **C-order (row-major)** or **F-order (column-major)**.
- `C` → faster for row-wise operations (default in NumPy, Python, ML libraries).
- `F` → preferred in Fortran/MATLAB & some SciPy linear algebra.
- Check layout: `arr.flags`
- Convert layout: `np.ascontiguousarray()`, `np.asfortranarray()`
- Contiguous memory = faster reshape and better ML compatibility.

### 7. Vectorization Techniques 

- Replace Python loops with vectorized NumPy operations.
- Array operations (`+`, `-`, `*`, `/`) run in fast C code.
- Boolean indexing is the vectorized replacement for `if` + loops.
- `np.where()` provides vectorized conditional logic.
- Ufuncs (`np.sin`, `np.exp`, `np.log`) operate element-wise automatically.
- Vectorization + broadcasting lead to massive speedups.

### 8. Normalization Using Broadcasting 
- Broadcasting automatically applies normalization over rows/columns.
- Column-wise normalization:
  `X_norm = (X - X.mean(axis=0)) / X.std(axis=0)`
- Row-wise normalization uses:
  `[:, None]` or `np.newaxis`
- Min–Max normalization:
  `(X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))`
- Used in ML preprocessing: Linear Regression, PCA, Neural Networks.


### 9. Random Shuffling & Permutation — Summary

- `np.random.shuffle(arr)`  
  → Shuffles array **in place** (modifies original).

- `np.random.permutation(arr)`  
  → Returns a **new shuffled copy** (original unchanged).

- Shuffling 2D arrays shuffles **rows only**.

- `np.random.permutation(len(X))`  
  → Generates random indices; useful for shuffling features & labels together.

- `np.random.choice(..., replace=False)`  
  → Sampling without replacement.
