# NumPy: The Foundation of Scientific Computing in Python

## Introduction

NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides:

- A powerful N-dimensional array object
- Sophisticated broadcasting functions
- Tools for integrating C/C++ and Fortran code
- Useful linear algebra, Fourier transform, and random number capabilities

This tutorial will introduce you to the most essential NumPy concepts and functions that are widely used in machine learning and data science projects.

**Source:** [NumPy Documentation](https://numpy.org/doc/stable/) and [numpy-ml](https://github.com/ddbourgin/numpy-ml)

## Why NumPy?

You can do numerical calculations using pure Python. In the beginning, you might think Python is fast enough, but once your data gets large, you'll start to notice slow downs.

One of the main reasons to use NumPy is because it's **fast**. Behind the scenes, the code has been optimized to run using C, which is another programming language that can do things much faster than Python.

The benefit of this being behind the scenes is you don't need to know any C to take advantage of it. You can write your numerical computations in Python using NumPy and get the added speed benefits.

### Key Advantages of NumPy:

1. **Performance**: NumPy arrays are stored in contiguous memory blocks, making operations much faster than with Python lists
2. **Vectorization**: Perform operations on entire arrays without explicit loops
3. **Broadcasting**: Efficiently perform operations between arrays of different shapes
4. **Memory Efficiency**: NumPy arrays use less memory than Python lists for the same data
5. **Integration**: Seamlessly works with other scientific Python libraries like SciPy, Pandas, and scikit-learn

In [None]:
# Import NumPy
import numpy as np
import time

# Let's demonstrate the speed difference between NumPy and Python lists
size = 1000000

# Python list
start_time = time.time()
python_list = list(range(size))
python_list = [x * 2 for x in python_list]
python_time = time.time() - start_time

# NumPy array
start_time = time.time()
numpy_array = np.arange(size)
numpy_array = numpy_array * 2
numpy_time = time.time() - start_time

print(f"Python list operation time: {python_time:.6f} seconds")
print(f"NumPy array operation time: {numpy_time:.6f} seconds")
print(f"NumPy is {python_time/numpy_time:.1f}x faster")

## 1. NumPy Arrays: The Building Blocks

NumPy arrays are the core data structure in NumPy. They are similar to Python lists but with added functionality and performance benefits.

### Concept: N-dimensional Arrays

NumPy arrays can be one-dimensional (vectors), two-dimensional (matrices), or higher-dimensional. This flexibility allows NumPy to represent complex data structures efficiently.

- **1D array**: A vector with a single axis
- **2D array**: A matrix with rows and columns
- **3D array**: A cube with depth, rows, and columns
- **N-D array**: An array with N axes

### 1.1 Creating NumPy Arrays

In [None]:
import numpy as np

# From Python lists
array_1d = np.array([1, 2, 3, 4, 5])  # 1D array (vector)
array_2d = np.array([[1, 2, 3], [4, 5, 6]])  # 2D array (matrix)
array_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])  # 3D array

print("1D array:")
print(array_1d)
print("Shape:", array_1d.shape)  # (5,) - 5 elements in 1 dimension
print("Dimensions:", array_1d.ndim)  # 1 dimension
print()

print("2D array:")
print(array_2d)
print("Shape:", array_2d.shape)  # (2, 3) - 2 rows, 3 columns
print("Dimensions:", array_2d.ndim)  # 2 dimensions
print()

print("3D array:")
print(array_3d)
print("Shape:", array_3d.shape)  # (2, 2, 2) - 2 blocks, 2 rows, 2 columns
print("Dimensions:", array_3d.ndim)  # 3 dimensions

### Concept: Array Creation Functions

NumPy provides various functions to create arrays with specific patterns or properties. These functions are essential for initializing data structures efficiently without manually specifying each element.

In [None]:
# Creating arrays with specific values
zeros = np.zeros((3, 4))  # 3x4 array of zeros
ones = np.ones((2, 3, 4))  # 2x3x4 array of ones
full = np.full((2, 2), 7)  # 2x2 array filled with 7

# Creating sequences
arange = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)  # 5 evenly spaced values from 0 to 1

# Random arrays
random_uniform = np.random.random((2, 3))  # 2x3 array of random values between 0 and 1
random_normal = np.random.normal(0, 1, (2, 3))  # 2x3 array from normal distribution (mean=0, std=1)
random_integers = np.random.randint(0, 10, (3, 3))  # 3x3 array of random integers from 0 to 9

# Identity matrix
identity = np.eye(3)  # 3x3 identity matrix

print("Zeros array:")
print(zeros)
print()

print("Sequence with arange:")
print(arange)
print()

print("Evenly spaced values with linspace:")
print(linspace)
print()

print("Random integers:")
print(random_integers)
print()

print("Identity matrix:")
print(identity)

### Concept: Array Attributes and Data Types

NumPy arrays have several attributes that provide information about their structure. Understanding these attributes is crucial for working with arrays effectively.

- **shape**: The dimensions of the array (tuple of integers)
- **ndim**: Number of dimensions (axes)
- **size**: Total number of elements
- **dtype**: Data type of the elements
- **itemsize**: Size in bytes of each element

NumPy supports various data types, allowing for memory optimization and precise numerical operations.

In [None]:
# Create an array with a specific data type
int_array = np.array([1, 2, 3], dtype=np.int32)
float_array = np.array([1, 2, 3], dtype=np.float64)

# Examine array attributes
array = np.random.randint(0, 10, (3, 4))

print("Array:")
print(array)
print("Shape:", array.shape)  # (3, 4)
print("Dimensions:", array.ndim)  # 2
print("Size (total elements):", array.size)  # 12
print("Data type:", array.dtype)  # int64
print("Item size (bytes):", array.itemsize)  # 8 bytes for int64
print("Total memory used (bytes):", array.nbytes)  # 96 bytes (12 elements * 8 bytes)

# Compare memory usage of different data types
print("\nMemory usage comparison:")
print(f"int32 array itemsize: {int_array.itemsize} bytes")
print(f"float64 array itemsize: {float_array.itemsize} bytes")

### 1.2 Array Indexing and Slicing

### Concept: Array Indexing

NumPy arrays can be indexed similarly to Python lists, but with extended capabilities for multi-dimensional arrays. Understanding indexing is essential for accessing and manipulating specific elements or subsets of data.

In [None]:
# Create a 2D array for demonstration
arr_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print("Original 2D array:")
print(arr_2d)
print()

# Indexing single elements
print("Element at row 0, column 1:", arr_2d[0, 1])  # 2
print("Element at row 2, column 3:", arr_2d[2, 3])  # 12
print()

# Slicing rows and columns
print("First row:")
print(arr_2d[0])  # [1, 2, 3, 4]
print()

print("First column:")
print(arr_2d[:, 0])  # [1, 5, 9]
print()

print("Rows 0-1, Columns 1-3:")
print(arr_2d[0:2, 1:4])  # [[2, 3, 4], [6, 7, 8]]
print()

# Advanced indexing
print("Specific elements using index arrays:")
rows = np.array([0, 2])
cols = np.array([1, 3])
print(arr_2d[rows, cols])  # [2, 12] - elements at (0,1) and (2,3)
print()

print("Boolean indexing - elements greater than 5:")
print(arr_2d[arr_2d > 5])  # [6, 7, 8, 9, 10, 11, 12]

### Concept: Views vs. Copies

When slicing NumPy arrays, it's important to understand the difference between views and copies. This concept is crucial for memory efficiency and avoiding unexpected behavior when modifying arrays.

In [None]:
# Create an array
original = np.array([1, 2, 3, 4, 5])

# Create a view (shares the same data)
view = original[1:4]  # [2, 3, 4]
print("Original array:", original)
print("View:", view)

# Modify the view
view[0] = 10
print("\nAfter modifying view:")
print("Original array:", original)  # [1, 10, 3, 4, 5] - original is modified
print("View:", view)  # [10, 3, 4]

# Create a copy (independent data)
original = np.array([1, 2, 3, 4, 5])  # Reset original
copy = original[1:4].copy()  # [2, 3, 4]
print("\nOriginal array:", original)
print("Copy:", copy)

# Modify the copy
copy[0] = 10
print("\nAfter modifying copy:")
print("Original array:", original)  # [1, 2, 3, 4, 5] - original is unchanged
print("Copy:", copy)  # [10, 3, 4]

### 1.3 Array Reshaping and Manipulation

### Concept: Reshaping Arrays

Reshaping allows you to change the dimensions of an array without changing its data. This is particularly useful when preparing data for machine learning algorithms that expect specific input shapes.

In [None]:
# Create a 1D array
arr = np.arange(12)  # [0, 1, 2, ..., 11]
print("Original 1D array:")
print(arr)
print("Shape:", arr.shape)  # (12,)
print()

# Reshape to 2D array (3 rows, 4 columns)
arr_2d = arr.reshape(3, 4)
print("Reshaped to 3x4:")
print(arr_2d)
print("Shape:", arr_2d.shape)  # (3, 4)
print()

# Reshape to 2D array (4 rows, 3 columns)
arr_2d_alt = arr.reshape(4, 3)
print("Reshaped to 4x3:")
print(arr_2d_alt)
print("Shape:", arr_2d_alt.shape)  # (4, 3)
print()

# Reshape to 3D array (2 blocks, 2 rows, 3 columns)
arr_3d = arr.reshape(2, 2, 3)
print("Reshaped to 2x2x3:")
print(arr_3d)
print("Shape:", arr_3d.shape)  # (2, 2, 3)
print()

# Using -1 to automatically calculate one dimension
arr_auto = arr.reshape(3, -1)  # 3 rows, columns calculated automatically
print("Reshaped with automatic column calculation:")
print(arr_auto)
print("Shape:", arr_auto.shape)  # (3, 4)

### Concept: Array Manipulation

NumPy provides various functions to manipulate arrays, such as concatenating, splitting, and transposing. These operations are essential for data preprocessing and feature engineering in machine learning.

In [None]:
# Create sample arrays
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

print("Array a:")
print(a)
print("\nArray b:")
print(b)
print()

# Concatenation
vertical_stack = np.vstack((a, b))  # Stack vertically (rows)
horizontal_stack = np.hstack((a, b))  # Stack horizontally (columns)

print("Vertical stack (vstack):")
print(vertical_stack)
print("Shape:", vertical_stack.shape)  # (4, 2)
print()

print("Horizontal stack (hstack):")
print(horizontal_stack)
print("Shape:", horizontal_stack.shape)  # (2, 4)
print()

# Splitting
arr = np.arange(16).reshape(4, 4)
print("Original array for splitting:")
print(arr)
print()

# Split horizontally (by rows)
row_split = np.vsplit(arr, 2)  # Split into 2 arrays along rows
print("Split by rows (vsplit):")
print("First part:")
print(row_split[0])
print("Second part:")
print(row_split[1])
print()

# Split vertically (by columns)
col_split = np.hsplit(arr, 2)  # Split into 2 arrays along columns
print("Split by columns (hsplit):")
print("First part:")
print(col_split[0])
print("Second part:")
print(col_split[1])
print()

# Transpose
arr = np.arange(6).reshape(2, 3)
print("Original array:")
print(arr)
print("Shape:", arr.shape)  # (2, 3)
print()

transposed = arr.T  # Transpose rows and columns
print("Transposed array:")
print(transposed)
print("Shape:", transposed.shape)  # (3, 2)

## 2. NumPy Mathematical Operations

### Concept: Vectorization

Vectorization is the process of performing operations on entire arrays without using explicit loops. This is one of the key features that makes NumPy so powerful and efficient. Vectorized operations are much faster than their loop-based counterparts.

In [None]:
# Create arrays
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

# Element-wise operations
print("Element-wise addition:", a + b)  # [6, 8, 10, 12]
print("Element-wise subtraction:", a - b)  # [-4, -4, -4, -4]
print("Element-wise multiplication:", a * b)  # [5, 12, 21, 32]
print("Element-wise division:", a / b)  # [0.2, 0.33, 0.43, 0.5]
print("Element-wise power:", a ** 2)  # [1, 4, 9, 16]
print()

# Comparison with Python loops
def python_multiply(list1, list2):
    result = []
    for i in range(len(list1)):
        result.append(list1[i] * list2[i])
    return result

# Time comparison for larger arrays
size = 1000000
large_a = np.random.random(size)
large_b = np.random.random(size)
large_list_a = large_a.tolist()
large_list_b = large_b.tolist()

# NumPy vectorized operation
start_time = time.time()
numpy_result = large_a * large_b
numpy_time = time.time() - start_time
print(f"NumPy vectorized multiplication time: {numpy_time:.6f} seconds")

# Python loop (only for small subset to avoid long execution)
subset_size = 10000  # Using smaller subset for loop demonstration
start_time = time.time()
python_result = python_multiply(large_list_a[:subset_size], large_list_b[:subset_size])
python_time = time.time() - start_time
print(f"Python loop multiplication time (for {subset_size} elements): {python_time:.6f} seconds")
print(f"Estimated time for full array: {python_time * (size/subset_size):.2f} seconds")

### Concept: Broadcasting

Broadcasting allows NumPy to work with arrays of different shapes when performing arithmetic operations. It's a powerful feature that simplifies code and makes it more efficient by avoiding unnecessary copies of data.

In [None]:
# Broadcasting examples
a = np.array([[1, 2, 3], [4, 5, 6]])  # 2x3 array
print("Array a (2x3):")
print(a)
print()

# Broadcasting scalar
print("Adding scalar 10 to each element:")
print(a + 10)  # Adds 10 to each element
print()

# Broadcasting 1D array to 2D array
b = np.array([10, 20, 30])  # 1D array with 3 elements
print("Array b (1D with 3 elements):")
print(b)
print()

print("Adding b to each row of a:")
print(a + b)  # b is broadcast to shape (2, 3) and added to a
print()

# Broadcasting with column vector
c = np.array([[100], [200]])  # 2x1 array (column vector)
print("Array c (column vector 2x1):")
print(c)
print()

print("Adding c to each column of a:")
print(a + c)  # c is broadcast to shape (2, 3) and added to a
print()

# Visualizing how broadcasting works
print("Broadcasting visualization:")
print("Original shapes:")
print(f"a: {a.shape} (2x3)")
print(f"b: {b.shape} (3,)")
print(f"c: {c.shape} (2x1)")
print()
print("During broadcasting:")
print("b is treated as:")
print(np.tile(b, (2, 1)))  # Equivalent to how b is broadcast
print()
print("c is treated as:")
print(np.tile(c, (1, 3)))  # Equivalent to how c is broadcast

### Concept: Aggregation Functions

NumPy provides various functions to perform aggregation operations on arrays, such as sum, mean, min, max, etc. These functions are essential for data analysis and feature engineering in machine learning.

In [None]:
# Create a 2D array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Array:")
print(arr)
print()

# Basic aggregation functions
print("Sum of all elements:", np.sum(arr))  # 45
print("Mean of all elements:", np.mean(arr))  # 5.0
print("Minimum value:", np.min(arr))  # 1
print("Maximum value:", np.max(arr))  # 9
print("Standard deviation:", np.std(arr))  # ~2.58
print("Variance:", np.var(arr))  # ~6.67
print()

# Aggregation along axes
print("Sum along rows (axis=0):")
print(np.sum(arr, axis=0))  # [12, 15, 18] - sum of each column
print()

print("Sum along columns (axis=1):")
print(np.sum(arr, axis=1))  # [6, 15, 24] - sum of each row
print()

print("Mean along rows (axis=0):")
print(np.mean(arr, axis=0))  # [4., 5., 6.] - mean of each column
print()

print("Mean along columns (axis=1):")
print(np.mean(arr, axis=1))  # [2., 5., 8.] - mean of each row
print()

# Cumulative functions
print("Cumulative sum along rows:")
print(np.cumsum(arr, axis=0))
print()

print("Cumulative sum along columns:")
print(np.cumsum(arr, axis=1))

### Concept: Universal Functions (ufuncs)

Universal functions (ufuncs) are functions that operate element-wise on arrays. They are optimized for performance and are a key component of NumPy's computational capabilities.

In [None]:
# Create an array
arr = np.array([0, np.pi/4, np.pi/2, np.pi])
print("Array (in radians):")
print(arr)
print()

# Trigonometric functions
print("Sine:")
print(np.sin(arr))  # [0., 0.7071, 1., 0.]
print()

print("Cosine:")
print(np.cos(arr))  # [1., 0.7071, 0., -1.]
print()

print("Tangent:")
print(np.tan(arr))  # [0., 1., inf, 0.]
print()

# Exponential and logarithmic functions
arr = np.array([1, 2, 3, 4])
print("Array:")
print(arr)
print()

print("Exponential (e^x):")
print(np.exp(arr))  # [2.7183, 7.3891, 20.0855, 54.5982]
print()

print("Natural logarithm (ln):")
print(np.log(arr))  # [0., 0.6931, 1.0986, 1.3863]
print()

print("Base-10 logarithm:")
print(np.log10(arr))  # [0., 0.301, 0.4771, 0.6021]
print()

print("Square root:")
print(np.sqrt(arr))  # [1., 1.4142, 1.7321, 2.]
print()

# Rounding functions
arr = np.array([1.49, 1.51, 2.49, 2.51, -1.49, -1.51])
print("Array:")
print(arr)
print()

print("Round to nearest integer:")
print(np.round(arr))  # [1., 2., 2., 3., -1., -2.]
print()

print("Ceiling (round up):")
print(np.ceil(arr))  # [2., 2., 3., 3., -1., -1.]
print()

print("Floor (round down):")
print(np.floor(arr))  # [1., 1., 2., 2., -2., -2.]

## 3. Real-world Applications of NumPy in Machine Learning

NumPy is the foundation for many machine learning algorithms and data processing tasks. Let's explore some real-world applications to see how NumPy is used in practice.

### 3.1 Linear Regression from Scratch

Linear regression is a fundamental machine learning algorithm that models the relationship between a dependent variable and one or more independent variables. Let's implement it using NumPy.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)  # For reproducibility
X = 2 * np.random.rand(100, 1)  # 100 random inputs between 0 and 2
y = 4 + 3 * X + np.random.randn(100, 1)  # y = 4 + 3X + noise

# Add bias term (intercept)
X_b = np.c_[np.ones((100, 1)), X]  # Add x0 = 1 to each instance

# Closed-form solution using Normal Equation: θ = (X^T X)^(-1) X^T y
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

print("Estimated parameters:")
print(f"Intercept: {theta_best[0][0]:.4f}")
print(f"Slope: {theta_best[1][0]:.4f}")
print(f"True parameters: Intercept = 4, Slope = 3")

# Make predictions
X_new = np.array([[0], [2]])  # Min and max X values
X_new_b = np.c_[np.ones((2, 1)), X_new]  # Add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Training data')
plt.plot(X_new, y_predict, 'r-', linewidth=2, label='Linear regression model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression from Scratch using NumPy')
plt.legend()
plt.grid(True)
plt.show()

# Implement gradient descent
def gradient_descent(X, y, learning_rate=0.1, n_iterations=1000):
    m = len(X)  # Number of instances
    n = X.shape[1]  # Number of features (including bias)
    theta = np.random.randn(n, 1)  # Random initialization
    
    # Store theta history and cost history for visualization
    theta_history = [theta.copy()]
    cost_history = []
    
    for iteration in range(n_iterations):
        # Compute predictions
        y_pred = X.dot(theta)
        
        # Compute error
        error = y_pred - y
        
        # Compute gradient
        gradients = (2/m) * X.T.dot(error)
        
        # Update parameters
        theta = theta - learning_rate * gradients
        
        # Store theta and cost
        theta_history.append(theta.copy())
        cost = np.mean(error ** 2)  # Mean Squared Error
        cost_history.append(cost)
    
    return theta, theta_history, cost_history

# Run gradient descent
theta_gd, theta_history, cost_history = gradient_descent(X_b, y, learning_rate=0.1, n_iterations=1000)

print("\nGradient Descent results:")
print(f"Intercept: {theta_gd[0][0]:.4f}")
print(f"Slope: {theta_gd[1][0]:.4f}")

# Plot cost history
plt.figure(figsize=(10, 6))
plt.plot(cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost (MSE)')
plt.title('Cost History during Gradient Descent')
plt.grid(True)
plt.show()

### Concept: Matrix Operations in Linear Regression

In the linear regression example above, we used several NumPy matrix operations:

1. **Matrix Multiplication** (`dot`): Used to compute predictions (X·θ) and gradients (X^T·error)
2. **Matrix Transpose** (`T`): Used to compute X^T for the normal equation and gradient calculation
3. **Matrix Inverse** (`linalg.inv`): Used to compute (X^T·X)^(-1) in the normal equation
4. **Broadcasting**: Used when subtracting vectors and computing squared errors

These operations are fundamental to many machine learning algorithms and demonstrate the power of NumPy for numerical computing.

### 3.2 Principal Component Analysis (PCA)

Principal Component Analysis is a dimensionality reduction technique widely used in machine learning. Let's implement it using NumPy.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Standardize the data (mean=0, std=1)
X_std = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

# Compute the covariance matrix
cov_matrix = np.cov(X_std.T)
print("Covariance matrix shape:", cov_matrix.shape)  # (4, 4) for 4 features

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Sort eigenvectors by decreasing eigenvalues
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# Select the top 2 eigenvectors (principal components)
W = eigenvectors[:, :2]

# Transform the data to the new subspace
X_pca = X_std.dot(W)

# Plot the results
plt.figure(figsize=(10, 8))
for i, target_name in enumerate(['Setosa', 'Versicolor', 'Virginica']):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], alpha=0.8, label=target_name)

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset (implemented with NumPy)')
plt.legend()
plt.grid(True)
plt.show()

# Explained variance ratio
explained_variance_ratio = eigenvalues / np.sum(eigenvalues)
print("Explained variance ratio:")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"Principal Component {i+1}: {ratio:.4f} ({ratio*100:.2f}%)")

# Cumulative explained variance
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
print("\nCumulative explained variance:")
for i, ratio in enumerate(cumulative_variance_ratio):
    print(f"First {i+1} components: {ratio:.4f} ({ratio*100:.2f}%)")

# Plot explained variance
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(eigenvalues) + 1), explained_variance_ratio, alpha=0.8, align='center',
        label='Individual explained variance')
plt.step(range(1, len(eigenvalues) + 1), cumulative_variance_ratio, where='mid',
         label='Cumulative explained variance')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Principal Components')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

### Concept: Eigendecomposition in PCA

In the PCA example, we used NumPy's linear algebra capabilities to perform eigendecomposition of the covariance matrix. This is a fundamental operation in many machine learning algorithms, including PCA, spectral clustering, and factor analysis.

The key steps in PCA are:

1. **Standardize the data**: Center the data by subtracting the mean and scale by dividing by the standard deviation
2. **Compute the covariance matrix**: Measure how features vary together
3. **Compute eigenvalues and eigenvectors**: Find the principal directions in the data
4. **Sort eigenvectors by eigenvalues**: Rank components by importance
5. **Project data onto principal components**: Transform data to the new coordinate system

NumPy's linear algebra module (`np.linalg`) provides efficient implementations of these operations, making it possible to implement complex algorithms like PCA with just a few lines of code.

### 3.3 K-means Clustering

K-means is a popular unsupervised learning algorithm for clustering data. Let's implement it from scratch using NumPy.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# Generate synthetic data
np.random.seed(42)
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Implement k-means clustering from scratch using NumPy:
def kmeans(X, k, max_iters=100, tol=1e-4):
    # Number of samples and features
    n_samples, n_features = X.shape
    
    # Randomly initialize k centroids
    idx = np.random.choice(n_samples, k, replace=False)
    centroids = X[idx, :]
    
    # Initialize cluster assignments
    prev_centroids = np.zeros((k, n_features))
    clusters = np.zeros(n_samples)
    
    # Store centroids history for visualization
    centroids_history = [centroids.copy()]
    
    # Iterate until convergence or max iterations
    for _ in range(max_iters):
        # Assign each sample to the closest centroid
        for i in range(n_samples):
            # Calculate distance to each centroid
            distances = np.sqrt(np.sum((X[i] - centroids) ** 2, axis=1))
            # Assign to the closest centroid
            clusters[i] = np.argmin(distances)
        
        # Store previous centroids
        prev_centroids = centroids.copy()
        
        # Update centroids
        for j in range(k):
            # Get all points assigned to this cluster
            cluster_points = X[clusters == j]
            # Update centroid if cluster is not empty
            if len(cluster_points) > 0:
                centroids[j] = np.mean(cluster_points, axis=0)
        
        # Store current centroids
        centroids_history.append(centroids.copy())
        
        # Check for convergence
        if np.all(np.abs(centroids - prev_centroids) < tol):
            break
    
    return clusters, centroids, centroids_history

# Run k-means
k = 4
clusters, centroids, centroids_history = kmeans(X, k)

# Plot the results
plt.figure(figsize=(12, 6))

# Plot original data with true labels
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7, s=40)
plt.title('Original Data with True Labels')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)

# Plot clustered data
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', alpha=0.7, s=40)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.title('K-means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

# Visualize centroid movement during iterations
plt.figure(figsize=(12, 8))
colors = ['r', 'g', 'b', 'purple']

# Plot data points
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', alpha=0.5, s=40)

# Plot centroid paths
for i in range(k):
    # Extract centroid positions for this cluster across all iterations
    centroid_path = np.array([centroids[i] for centroids in centroids_history])
    
    # Plot the path
    plt.plot(centroid_path[:, 0], centroid_path[:, 1], c=colors[i], marker='o', 
             markersize=8, linewidth=2, alpha=0.7, label=f'Centroid {i+1} path')
    
    # Mark the final position
    plt.scatter(centroid_path[-1, 0], centroid_path[-1, 1], c=colors[i], marker='X', s=200, edgecolor='black')

plt.title('Centroid Movement during K-means Iterations')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
plt.show()

### Concept: Distance Calculations in K-means

In the K-means implementation, we used NumPy's vectorized operations to efficiently calculate distances between points and centroids. The Euclidean distance between a point `x` and a centroid `c` is calculated as:

$$d(x, c) = \sqrt{\sum_{i=1}^{n} (x_i - c_i)^2}$$

In NumPy, this is implemented as:
```python
distances = np.sqrt(np.sum((X[i] - centroids) ** 2, axis=1))
```

This vectorized operation calculates the distance from a single point to all centroids simultaneously, which is much more efficient than using loops. This is a common pattern in NumPy-based machine learning implementations.

## Practice Problems

Now that you've learned the fundamentals of NumPy, try solving these practice problems to test your understanding.

### Problem 1: Array Manipulation

Create a 5x5 array of random integers between 1 and 100. Then:
1. Extract the central 3x3 subarray
2. Replace the corners of the original array with zeros
3. Calculate the sum of each row and each column

In [None]:
# Your solution here
import numpy as np

# Create a 5x5 array of random integers between 1 and 100
np.random.seed(42)  # For reproducibility
arr = np.random.randint(1, 101, size=(5, 5))
print("Original array:")
print(arr)
print()

# 1. Extract the central 3x3 subarray
central = arr[1:4, 1:4]
print("Central 3x3 subarray:")
print(central)
print()

# 2. Replace the corners of the original array with zeros
arr_corners = arr.copy()
arr_corners[0, 0] = 0  # Top-left
arr_corners[0, -1] = 0  # Top-right
arr_corners[-1, 0] = 0  # Bottom-left
arr_corners[-1, -1] = 0  # Bottom-right
print("Array with corners replaced by zeros:")
print(arr_corners)
print()

# 3. Calculate the sum of each row and each column
row_sums = np.sum(arr, axis=1)
col_sums = np.sum(arr, axis=0)
print("Sum of each row:")
print(row_sums)
print("\nSum of each column:")
print(col_sums)

### Problem 2: Broadcasting and Vectorization

1. Create a 4x3 array of random numbers
2. Normalize each row so that it sums to 1 (hint: use broadcasting)
3. Calculate the Euclidean distance between each row and the first row

In [None]:
# Your solution here
import numpy as np

# 1. Create a 4x3 array of random numbers
np.random.seed(42)  # For reproducibility
arr = np.random.rand(4, 3)
print("Original array:")
print(arr)
print()

# 2. Normalize each row so that it sums to 1
row_sums = np.sum(arr, axis=1)
normalized = arr / row_sums[:, np.newaxis]  # Broadcasting
print("Normalized array (each row sums to 1):")
print(normalized)
print("\nVerify row sums:")
print(np.sum(normalized, axis=1))  # Should be all 1's
print()

# 3. Calculate the Euclidean distance between each row and the first row
first_row = normalized[0]
distances = np.sqrt(np.sum((normalized - first_row) ** 2, axis=1))
print("Euclidean distances from first row:")
print(distances)

### Problem 3: Implementing a Simple Neural Network Layer

Implement a simple neural network forward pass for a single layer with the following steps:
1. Create a weight matrix of shape (3, 4) with random values
2. Create a bias vector of shape (3,) with random values
3. Create an input matrix of shape (5, 4) with random values
4. Compute the output using the formula: output = input @ weights.T + bias
5. Apply a ReLU activation function: relu(x) = max(0, x)

In [None]:
# Your solution here
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# 1. Create a weight matrix of shape (3, 4) with random values
weights = np.random.randn(3, 4)
print("Weights shape:", weights.shape)
print("Weights:")
print(weights)
print()

# 2. Create a bias vector of shape (3,) with random values
bias = np.random.randn(3)
print("Bias shape:", bias.shape)
print("Bias:", bias)
print()

# 3. Create an input matrix of shape (5, 4) with random values
inputs = np.random.randn(5, 4)
print("Inputs shape:", inputs.shape)
print("Inputs:")
print(inputs)
print()

# 4. Compute the output using the formula: output = input @ weights.T + bias
# Note: @ is the matrix multiplication operator in Python 3.5+
linear_output = inputs @ weights.T + bias
print("Linear output shape:", linear_output.shape)
print("Linear output:")
print(linear_output)
print()

# 5. Apply a ReLU activation function: relu(x) = max(0, x)
relu_output = np.maximum(0, linear_output)
print("ReLU output shape:", relu_output.shape)
print("ReLU output:")
print(relu_output)

## Additional Resources

To further enhance your NumPy skills, check out these resources:

- [NumPy Documentation](https://numpy.org/doc/stable/)
- [NumPy Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
- [NumPy Tutorials](https://numpy.org/numpy-tutorials/)
- [NumPy ML Implementations](https://github.com/ddbourgin/numpy-ml)
- [From Python to NumPy](https://www.labri.fr/perso/nrougier/from-python-to-numpy/)