# Lesson 1.6: Introduction to NumPy

## Introduction
**NumPy**, short for Numerical Python, is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Many computational and data science packages use NumPy as the main building block. It is a fundamental library for scientific computing in Python.

### Key Features of NumPy:
* **ndarray**: An efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
* **Vectorization**: Mathematical functions for fast operations on entire arrays of data without having to write loops.
* **Linear Algebra**: Tools for random number generation, Fourier transforms, and matrix manipulation.
* **C API**: For connecting NumPy with libraries written in C, C++, or FORTRAN.

### Advantages over Python Lists:
1. **Contiguous Memory**: NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. This allows for significantly faster access and manipulation.
2. **Vectorized Operations**: NumPy algorithms written in C can operate on this memory without type checking or other Python overhead, performing complex computations without slow `for` loops.

![numpy_vs_list](https://github.com/999crabs-commits/6m-data-1.6-intro-numpy/blob/main/assets/numpy_vs_python_list.png?raw=1)

## Part 1: Performance Benchmark
To give you an idea of the performance difference, consider a NumPy array of one million integers and an equivalent Python list. We use the `%timeit` magic command to measure execution time.

In [1]:
import numpy as np
my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

print("NumPy Vectorized Multiplication (my_arr * 2):")
%timeit my_arr2 = my_arr * 2

print("\nPython List Comprehension ([x * 2 for x in my_list]):")
%timeit my_list2 = [x * 2 for x in my_list]

NumPy Vectorized Multiplication (my_arr * 2):
2.09 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Python List Comprehension ([x * 2 for x in my_list]):
82.8 ms ± 37.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Part 2: The ndarray (N-dimensional array)
The `ndarray` is a fast, flexible container for large datasets. It is a multidimensional array of fixed size with **homogeneous** elements (all elements must be of the same type).

Every array has:
* **shape**: A tuple indicating the size of each dimension.
* **dtype**: An object describing the data type of the array.
* **ndim**: The number of dimensions (axes).

### ndarray illustration
![ndarray](https://github.com/999crabs-commits/6m-data-1.6-intro-numpy/blob/main/assets/numpy_ndarray.png?raw=1)

In [19]:
# [DEMO] Creating arrays from sequences
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

print(f"Array 2:\n{arr2}")
print(f"Shape: {arr2.shape}, Dtype: {arr2.dtype}, Dimensions: {arr2.ndim}")

Array 2:
[[1 2 3 4]
 [5 6 7 8]]
Shape: (2, 4), Dtype: int64, Dimensions: 2


### Data Types and Casting
NumPy supports specific numerical types like `int32`, `float64`, etc. You can explicitly convert an array from one `dtype` to another using the `astype` method.

**Note:** If you cast floating-point numbers to an integer `dtype`, the decimal part will be truncated.

In [None]:
# [DEMO] Casting arrays
arr = np.array([3.7, -1.2, 0.5, 12.9])
print("Original:", arr)
print("Casted to int32:", arr.astype(np.int32))

### [EXERCISE 1: Creation & Casting]
1. Create a 3x4 array of all ones using `np.ones()`.
2. Cast this array to `float32`.
3. Create an array of strings representing numbers: `['1.25', '-9.6', '42']`. Cast it to `float`.

In [5]:
import numpy as np

# 1. Create a 3x4 array of all ones using np.ones().
ones_array = np.ones((3, 4))
print("3x4 array of ones:\n", ones_array)

# 2. Cast this array to float32.
float32_array = ones_array.astype(np.float32)
print("\nCasted to float32:\n", float32_array)
print("Dtype after casting:", float32_array.dtype)

# 3. Create an array of strings representing numbers: ['1.25', '-9.6', '42']. Cast it to float.
string_array = np.array(['1.25', '-9.6', '42'])
float_from_string_array = string_array.astype(float)
print("\nOriginal string array:", string_array)
print("Casted to float from string array:", float_from_string_array)
print("Dtype after casting:", float_from_string_array.dtype)

3x4 array of ones:
 [[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
Dtype: float64

Casted to float32:
 [[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
Dtype after casting: float32

Original string array: ['1.25' '-9.6' '42']
Casted to float from string array: [ 1.25 -9.6  42.  ]
Dtype after casting: float64


## Part 3: Arithmetic & Broadcasting
Arithmetic operations are applied as batch operations without for loops. **Broadcasting** describes how arithmetic works between arrays of different shapes.

![vectorization](https://github.com/999crabs-commits/6m-data-1.6-intro-numpy/blob/main/assets/vectorization.png?raw=1)

Example: A scalar value being replicated (broadcast) to match the shape of a larger array.

In [6]:
# [DEMO] Arithmetic & Broadcasting
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print("Element-wise multiplication (arr * arr):\n", arr * arr)
print("\nBroadcasting scalar (1 / arr):\n", 1 / arr)

Element-wise multiplication (arr * arr):
 [[ 1.  4.  9.]
 [16. 25. 36.]]

Broadcasting scalar (1 / arr):
 [[1.         0.5        0.33333333]
 [0.25       0.2        0.16666667]]


## Part 4: Indexing and Slicing
One-dimensional arrays act similarly to Python lists. In 2D arrays, indexing can be done with `[row, column]` syntax.

### 2D Array Indexing Syntax
![2d_array_indexing](https://github.com/999crabs-commits/6m-data-1.6-intro-numpy/blob/main/assets/ndarray_axis_index.png?raw=1)

**Important:** Array slices are **views** on the original array. This means data is not copied, and modifications to the slice will be reflected in the source array.

In [8]:
# [DEMO] Slicing views
arr = np.arange(10)
arr_slice = arr[5:8]
arr_slice[1] = 12345
print("Original array modified via slice:", arr)

# [DEMO] 2D Slicing
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nFirst two rows, columns 1 onwards:\n", arr2d[:2, 1:])

Original array modified via slice: [    0     1     2     3     4     5 12345     7     8     9]

First two rows, columns 1 onwards:
 [[2 3]
 [5 6]]


### [EXERCISE 2: The Logic of Slicing]
1. Select the first column of `arr2d` using a slice.
2. Set all values in the second row to 0.
3. **Socratic Prompt:** How does `arr2d[1]` differ from `arr2d[1, :]`? (Hint: check shapes)

In [18]:
# Your code here
import numpy as np

arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Re-initialize arr2d for demonstration
print("Original arr2d:\n", arr2d)

# 1. Select the first column of arr2d using a slice.
first_column = arr2d[:, 0]
print("\nFirst column of arr2d:\n", first_column)

# 2. Set all values in the second row to 0.
arr2d[1, :] = 0
print("\narr2d after setting second row to 0:\n", arr2d)

# Socratic Prompt: How does arr2d[1] differ from arr2d[1, :]?
print("\nShape of arr2d[1]:", np.array(arr2d[1]).shape)
print("Shape of arr2d[1, :]:", arr2d[1, :].shape)
print("Are they the same object (view)?", arr2d[1] is arr2d[1,:])

Original arr2d:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

First column of arr2d:
 [1 4 7]

arr2d after setting second row to 0:
 [[1 2 3]
 [0 0 0]
 [7 8 9]]

Shape of arr2d[1]: (3,)
Shape of arr2d[1, :]: (3,)
Are they the same object (view)? False


## Part 5: Boolean Indexing
Like arithmetic operations, comparisons (such as `==`) with arrays are vectorized. This yields a boolean array which can be used to filter data.

In [None]:
# [DEMO] Filtering scores
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
scores = np.array([[75, 80], [85, 90], [95, 100], [100, 77], [85, 92], [95, 80], [72, 80]])

bob_mask = (names == 'Bob')
print("Mask:", bob_mask)
print("Bob's scores:\n", scores[bob_mask])

### [EXERCISE 3: Complex Filtering]
1. Select all scores where the name is NOT 'Bob'.
2. Select scores for 'Bob' or 'Will' using the `|` operator.
3. Find all scores less than 80 and set them to 0.

In [13]:
# Your code here
# [DEMO] Filtering scores
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
scores = np.array([[75, 80], [85, 90], [95, 100], [100, 77], [85, 92], [95, 80], [72, 80]])

# 1. Select all scores where the name is NOT 'Bob'.
notbob_mask = (names != 'Bob')
print("Mask (Not Bob):", notbob_mask)
print("Not Bob's scores:\n", scores[notbob_mask])

# 2. Select scores for 'Bob' or 'Will' using the | operator.
bob_or_will_mask = (names == 'Bob') | (names == 'Will')
print("\nMask (Bob or Will):", bob_or_will_mask)
print("Bob or Will's scores:\n", scores[bob_or_will_mask])

# 3. Find all scores less than 80 and set them to 0.
print("\nOriginal scores for modification:\n", scores)
scores_less_than_80_mask = scores < 80
scores[scores_less_than_80_mask] = 0
print("Scores after setting values less than 80 to 0:\n", scores)

Mask (Not Bob): [False  True  True False  True  True  True]
Not Bob's scores:
 [[ 85  90]
 [ 95 100]
 [ 85  92]
 [ 95  80]
 [ 72  80]]

Mask (Bob or Will): [ True False  True  True  True False False]
Bob or Will's scores:
 [[ 75  80]
 [ 95 100]
 [100  77]
 [ 85  92]]

Original scores for modification:
 [[ 75  80]
 [ 85  90]
 [ 95 100]
 [100  77]
 [ 85  92]
 [ 95  80]
 [ 72  80]]
Scores after setting values less than 80 to 0:
 [[  0  80]
 [ 85  90]
 [ 95 100]
 [100   0]
 [ 85  92]
 [ 95  80]
 [  0  80]]


## Part 6: Universal Functions (ufuncs) and Methods
A **ufunc** is a function that performs element-wise operations on data in ndarrays.

* **Unary ufuncs**: Take one array (e.g., `sqrt`, `exp`).
* **Binary ufuncs**: Take two arrays (e.g., `add`, `maximum`).
* **Statistical Methods**: `mean`, `sum`, `std` can be computed over the entire array or along an axis.

In [14]:
# [DEMO] Statistical Methods
arr = np.random.randn(3, 4)
print("Random Array:\n", arr)
print("\nMean down rows (axis=0):", arr.mean(axis=0))
print("Sum across columns (axis=1):", arr.sum(axis=1))

Random Array:
 [[-1.16455138 -0.53523268 -0.83809406 -1.14051486]
 [-0.2741838  -0.84294596 -1.30623438  0.17642206]
 [-0.89559993  0.36654267  0.61121136  0.08390943]]

Mean down rows (axis=0): [-0.7781117  -0.33721199 -0.51103903 -0.29339446]
Sum across columns (axis=1): [-3.67839298 -2.24694208  0.16606353]


## Part 7: Linear Algebra
Linear algebra operations, like matrix multiplication, are crucial for many data science algorithms. Multiplying two arrays with `*` is an element-wise product; for matrix multiplication, use `.dot()` or the `@` operator.

![matrix_multiplication](https://github.com/999crabs-commits/6m-data-1.6-intro-numpy/blob/main/assets/matrix_multiplication.png?raw=1)

In [17]:
# [DEMO] Matrix Multiplication
x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([[6, 23], [-1, 7], [8, 9]])

print("Matrix product (x @ y):\n", x @ y)

Matrix product (x @ y):
 [[ 28  64]
 [ 67 181]]


### [EXERCISE 4: Reshaping & Statistics]
1. Create an array of 15 integers using `arange(15)` and reshape it to `(3, 5)`.
2. Calculate the average value of each row.
3. Use `np.unique()` to find distinct elements in an array of your choice.
4. Transpose the reshaped array using `.T` and check the new shape.

In [16]:
# Your code here
import numpy as np

# 1. Create an array of 15 integers using arange(15) and reshape it to (3, 5).
arr = np.arange(15).reshape((3, 5))
print("Original reshaped array (3x5):\n", arr)

# 2. Calculate the average value of each row.
row_averages = arr.mean(axis=1)
print("\nAverage value of each row:", row_averages)

# 3. Use np.unique() to find distinct elements in an array of your choice.
duplicate_array = np.array([1, 2, 2, 3, 4, 4, 5, 1])
unique_elements = np.unique(duplicate_array)
print("\nOriginal array with duplicates:", duplicate_array)
print("Unique elements:", unique_elements)

# 4. Transpose the reshaped array using .T and check the new shape.
transpose_arr = arr.T
print("\nTransposed array:\n", transpose_arr)
print("Shape of transposed array:", transpose_arr.shape)

Original reshaped array (3x5):
 [[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

Average value of each row: [ 2.  7. 12.]

Original array with duplicates: [1 2 2 3 4 4 5 1]
Unique elements: [1 2 3 4 5]

Transposed array:
 [[ 0  5 10]
 [ 1  6 11]
 [ 2  7 12]
 [ 3  8 13]
 [ 4  9 14]]
Shape of transposed array: (5, 3)
