# Lesson 1.6: Introduction to NumPy

## Introduction
**NumPy**, short for Numerical Python, is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Many computational and data science packages use NumPy as the main building block. It is a fundamental library for scientific computing in Python.

### Key Features of NumPy:
* **ndarray**: An efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
* **Vectorization**: Mathematical functions for fast operations on entire arrays of data without having to write loops.
* **Linear Algebra**: Tools for random number generation, Fourier transforms, and matrix manipulation.
* **C API**: For connecting NumPy with libraries written in C, C++, or FORTRAN.

### Advantages over Python Lists:
1. **Contiguous Memory**: NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. This allows for significantly faster access and manipulation.
2. **Vectorized Operations**: NumPy algorithms written in C can operate on this memory without type checking or other Python overhead, performing complex computations without slow `for` loops.

![numpy_vs_list](https://github.com/Clydine/6m-data-1.6-intro-numpy/blob/main/assets/numpy_vs_python_list.png?raw=1)

## Part 1: Performance Benchmark
To give you an idea of the performance difference, consider a NumPy array of one million integers and an equivalent Python list. We use the `%timeit` magic command to measure execution time.

In [None]:
import numpy as np
my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

print("NumPy Vectorized Multiplication (my_arr * 2):")
%timeit my_arr2 = my_arr * 2

print("\nPython List Comprehension ([x * 2 for x in my_list]):")
%timeit my_list2 = [x * 2 for x in my_list]

## Part 2: The ndarray (N-dimensional array)
The `ndarray` is a fast, flexible container for large datasets. It is a multidimensional array of fixed size with **homogeneous** elements (all elements must be of the same type).

Every array has:
* **shape**: A tuple indicating the size of each dimension.
* **dtype**: An object describing the data type of the array.
* **ndim**: The number of dimensions (axes).

### ndarray illustration
![ndarray](https://github.com/Clydine/6m-data-1.6-intro-numpy/blob/main/assets/numpy_ndarray.png?raw=1)

In [None]:
# [DEMO] Creating arrays from sequences
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)

print(f"Array 2:\n{arr2}")
print(f"Shape: {arr2.shape}, Dtype: {arr2.dtype}, Dimensions: {arr2.ndim}")

In [None]:
a = np.arange(15)
print(f"Original array: {a}")
print(f"Shape: {a.shape}, Dtype: {a.dtype}, Dimensions: {a.ndim}")

# 1. Replacing a single element by index
a[0] = 100
print(f"\nArray after replacing the first element: {a}")

# 2. Replacing a slice of elements
a[1:4] = [-1, -2, -3]
print(f"Array after replacing elements in a slice: {a}")

# 3. Replacing elements based on a boolean condition
#    Let's reset 'a' for this example to make it clearer
a = np.arange(15)
mask = a > 10
print(f"\nOriginal array for boolean indexing: {a}")
print(f"Boolean mask (elements > 10): {mask}")
a[mask] = 99
print(f"Array after replacing elements greater than 10 with 99: {a}")

### Data Types and Casting
NumPy supports specific numerical types like `int32`, `float64`, etc. You can explicitly convert an array from one `dtype` to another using the `astype` method.

**Note:** If you cast floating-point numbers to an integer `dtype`, the decimal part will be truncated.

In [None]:
# [DEMO] Casting arrays
arr = np.array([3.7, -1.2, 0.5, 12.9])
print("Original:", arr)
print("Casted to int32:", arr.astype(np.int32))

### [EXERCISE 1: Creation & Casting]
1. Create a 3x4 array of all ones using `np.ones()`.
2. Cast this array to `float32`.
3. Create an array of strings representing numbers: `['1.25', '-9.6', '42']`. Cast it to `float`.

In [None]:
# @title
# Your code here
a=np.ones((3,4))
print(a.astype(np.float32))
b=['1.25', '-9.6', '42']
c=np.array(b)
print(c.astype(np.float32))

In [None]:
a=np.ones((4,3))

mask=a<=1
a[mask]=1.5
print(a)
print(f"Shape: {a.shape}, Dtype: {a.dtype}, Dimensions: {a.ndim}")

In [None]:
b=np.ones((2,2,3))
b=b.astype(np.int32)
mask=b<=1
b[mask]=0
print(b)
print(f"Shape: {b.shape}, Dtype: {b.dtype}, Dimensions: {b.ndim}")

## Part 3: Arithmetic & Broadcasting
Arithmetic operations are applied as batch operations without for loops. **Broadcasting** describes how arithmetic works between arrays of different shapes.

![vectorization](https://github.com/Clydine/6m-data-1.6-intro-numpy/blob/main/assets/vectorization.png?raw=1)

Example: A scalar value being replicated (broadcast) to match the shape of a larger array.

In [None]:
# [DEMO] Arithmetic & Broadcasting
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
print("Element-wise multiplication (arr * arr):\n", arr * arr)
print("\nBroadcasting scalar (1 / arr):\n", 1 / arr)

## Part 4: Indexing and Slicing
One-dimensional arrays act similarly to Python lists. In 2D arrays, indexing can be done with `[row, column]` syntax.

### 2D Array Indexing Syntax
![2d_array_indexing](https://github.com/Clydine/6m-data-1.6-intro-numpy/blob/main/assets/ndarray_axis_index.png?raw=1)

**Important:** Array slices are **views** on the original array. This means data is not copied, and modifications to the slice will be reflected in the source array.

In [None]:
# [DEMO] Slicing views
arr = np.arange(10)
arr_slice = arr[5:8]
arr_slice[1] = 12345
print("Original array modified via slice:", arr)

# [DEMO] 2D Slicing
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nFirst two rows, columns 1 onwards:\n", arr2d[:2, 1:])

### [EXERCISE 2: The Logic of Slicing]
1. Select the first column of `arr2d` using a slice.
2. Set all values in the second row to 0.
3. **Socratic Prompt:** How does `arr2d[1]` differ from `arr2d[1, :]`? (Hint: check shapes)

In [None]:
# Your code here
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d_slice = arr2d[0:,0]
print(f"First column: {arr2d_slice}")

arr2d[1,:] = 0
print(f"Array after setting second row to 0:\n{arr2d}")

print(f"\narr2d[1] (content): {arr2d[1]}")
print(f"arr2d[1] (shape): {arr2d[1].shape}")
print(f"arr2d[1] (dimension): {arr2d[1].ndim}")

print(f"\narr2d[1, :] (content): {arr2d[1, :]}")
print(f"arr2d[1, :] (shape): {arr2d[1, :].shape}")
print(f"arr2d[1, :] (dimension): {arr2d[1, :].ndim}")

In [None]:
import numpy as np
arr2d_new = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print(f"Original array:\n{arr2d_new}\n")

# Using single integer index for the row
row_indexed = arr2d_new[1]
print(f"arr2d_new[1]: {row_indexed}")
print(f"Shape of arr2d_new[1]: {row_indexed.shape}")
print(f"Dimension of arr2d_new[1]: {row_indexed.ndim}\n")

# Using single integer index for row and full slice for columns
row_indexed_slice = arr2d_new[1, :]
print(f"arr2d_new[1, :]: {row_indexed_slice}")
print(f"Shape of arr2d_new[1, :]: {row_indexed_slice.shape}")
print(f"Dimension of arr2d_new[1, :]: {row_indexed_slice.ndim}\n")

# Using slicing for the row to preserve dimension
row_sliced_preserved = arr2d_new[1:2, :]
print(f"arr2d_new[1:2, :]:\n{row_sliced_preserved}")
print(f"Shape of arr2d_new[1:2, :]: {row_sliced_preserved.shape}")
print(f"Dimension of arr2d_new[1:2, :]: {row_sliced_preserved.ndim}")

In [None]:
# Your code here
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d_slice = arr2d[0:,0]
print(arr2d_slice)
arr2d[1,:] = 0
print(arr2d)
print(arr2d[1])
print(arr2d[1,:])

## Part 5: Boolean Indexing
Like arithmetic operations, comparisons (such as `==`) with arrays are vectorized. This yields a boolean array which can be used to filter data.

In [None]:
# [DEMO] Filtering scores
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
scores = np.array([[75, 80], [85, 90], [95, 100], [100, 77], [85, 92], [95, 80], [72, 80]])

bob_mask = (names == 'Bob')
print("Mask:", bob_mask)
print("Bob's scores:\n", scores[bob_mask])

### [EXERCISE 3: Complex Filtering]
1. Select all scores where the name is NOT 'Bob'.
2. Select scores for 'Bob' or 'Will' using the `|` operator.
3. Find all scores less than 80 and set them to 0.

In [None]:
# Your code here
not_bob = (names != 'Bob')


## Part 6: Universal Functions (ufuncs) and Methods
A **ufunc** is a function that performs element-wise operations on data in ndarrays.

* **Unary ufuncs**: Take one array (e.g., `sqrt`, `exp`).
* **Binary ufuncs**: Take two arrays (e.g., `add`, `maximum`).
* **Statistical Methods**: `mean`, `sum`, `std` can be computed over the entire array or along an axis.

In [None]:
# [DEMO] Statistical Methods
arr = np.random.randn(3, 4)
print("Random Array:\n", arr)
print("\nMean down rows (axis=0):", arr.mean(axis=0))
print("Sum across columns (axis=1):", arr.sum(axis=1))

daily_visits = np.array([
    120, 130, 125, 140, 150,
    160, 170, 155, 145, 135,
    128, 132, 138, 142, 148,
    152, 158, 162, 168, 172,
    180, 190, 185, 175, 165,
    155, 145, 135, 125, 115
])

Reshape daily_visits into a (6, 5) array representing 6 weeks × 5 days.
Compute total visits per week.
Compute average visits per day of the week (e.g., all “day 1 of week” together, all “day 2 of week” together, etc.).
Flatten the reshaped array back to 1D and confirm it matches the original daily_visits.

In [None]:
daily_visits = np.array([
    120, 130, 125, 140, 150,
    160, 170, 155, 145, 135,
    128, 132, 138, 142, 148,
    152, 158, 162, 168, 172,
    180, 190, 185, 175, 165,
    155, 145, 135, 125, 115
])
b=np.reshape(daily_visits,(6,5))
print(b)
c=b.sum(axis=1)
print(c)
d=b.mean(axis=0)
print(d)
print(daily_visits.flatten())

## Part 7: Linear Algebra
Linear algebra operations, like matrix multiplication, are crucial for many data science algorithms. Multiplying two arrays with `*` is an element-wise product; for matrix multiplication, use `.dot()` or the `@` operator.

![matrix_multiplication](https://github.com/Clydine/6m-data-1.6-intro-numpy/blob/main/assets/matrix_multiplication.png?raw=1)

In [None]:
# [DEMO] Matrix Multiplication
x = np.array([[1, 2, 3], [4, 5, 6]])
y = np.array([[6, 23], [-1, 7], [8, 9]])

print("Matrix product (x @ y):\n", x @ y)

This mirrors the tiny matrix multiply example from class, but with more features.

# Each row: [page_views, time_on_site (minutes), past_purchases]
X = np.array([
    [10,  3.5,  0],
    [25,  5.0,  1],
    [40,  2.0,  0],
    [15, 10.0,  3],
    [30,  4.0,  2]
])

customers = np.array(["C1", "C2", "C3", "C4", "C5"])
Choose a weight vector w = [w_views, w_time, w_purchases] (for example [0.1, 0.5, 1.0]).
Compute a score for each customer using scores = X @ w.
Rank customers by score (highest first).
Change the weights to emphasize past_purchases more than other features, and recompute.
In a markdown cell, answer:
Which customer is top-ranked before vs after changing weights?
In what real-world situation might you prefer each weighting?


In [31]:
import numpy as np
# Each row: [page_views, time_on_site (minutes), past_purchases]
X = np.array([
    [10,  3.5,  0],
    [25,  5.0,  1],
    [40,  2.0,  0],
    [15, 10.0,  3],
    [30,  4.0,  2]
])

customers = np.array(["C1", "C2", "C3", "C4", "C5"])
w = [0.1,0.5,1.0]
score= X@w
# No need to cast to int32 for comparison, keep as float for better precision in ranking
print(f"Scores: {score}")

# Get indices that would sort the score array in descending order
sorted_indices = np.argsort(score)[::-1]
print(f"Sorted indices: {sorted_indices}")

# Rank customers based on the sorted indices
customers_ranked = customers[sorted_indices]
print(f"Customers ranked by score (highest first): {customers_ranked}")

# Now, let's change the weights to emphasize past_purchases more
w_new = [0.05, 0.2, 1.5] # Emphasize past_purchases (1.5) more than views (0.05) and time (0.2)
score_new = X @ w_new
print(f"\nNew scores with changed weights: {score_new}")

sorted_indices_new = np.argsort(score_new)[::-1]
customers_ranked_new = customers[sorted_indices_new]
print(f"Customers ranked with new weights: {customers_ranked_new}")

Scores: [2.75 6.   5.   9.5  7.  ]
Sorted indices: [3 4 1 2 0]
Customers ranked by score (highest first): ['C4' 'C5' 'C2' 'C3' 'C1']

New scores with changed weights: [1.2  3.75 2.4  7.25 5.3 ]
Customers ranked with new weights: ['C4' 'C5' 'C2' 'C3' 'C1']


### [EXERCISE 4: Reshaping & Statistics]
1. Create an array of 15 integers using `arange(15)` and reshape it to `(3, 5)`.
2. Calculate the average value of each row.
3. Use `np.unique()` to find distinct elements in an array of your choice.
4. Transpose the reshaped array using `.T` and check the new shape.

In [None]:
# Your code here
