## NumPy

### DATA 601

**Syed Tauhid Ullah Shah ([Syed.Tauhidullahshah@ucalgary.ca](mailto:Syed.Tauhidullahshahi@ucalgary.ca))** 

Further Reading:

* **Python for Data Analysis** (third edition), by _Wes McKinney_ (Chapter 4). [Available online](https://wesmckinney.com/book/numpy-basics.html)
* **[NumPy User Guide](https://numpy.org/devdocs/user/index.html)**



## Topics


## Table of Contents
1. [Introduction to NumPy](#introduction)
2. [Installation](#installation)
3. [Basic Array Creation](#array-creation)
4. [Array Attributes and Data Types](#attributes-dtypes)
5. [Array Operations](#operations)
6. [Universal Functions (ufuncs)](#ufuncs)
7. [Indexing, Slicing, and Iterating](#indexing)
8. [Shape Manipulation](#shape)
9. [Combining and Splitting Arrays](#combining-splitting)
10. [Copy vs. View](#copy-view)
11. [Broadcasting](#broadcasting)
12. [Linear Algebra with NumPy](#linear-algebra)
13. [Advanced Linear Algebra Functions](#advanced-linalg)
14. [Random Module and Random Numbers](#random)
15. [Integration with Other Libraries](#integration)
16. [Conclusion and Further Resources](#conclusion)
- [Exercise: Working with text data](#exercise)

## 1. Introduction to NumPy <a name="introduction"></a>

**NumPy** is a foundational library for numerical computing in Python. It provides a high-performance multidimensional array object (`ndarray`) and tools to perform operations on these arrays efficiently.

**Key Points:**
- **Ndarray:** A fast, flexible container for large datasets in Python.
- **Vectorized Operations:** Perform operations without explicit loops, leveraging optimized C code for speed.
- **Wide Range of Functions:** Includes mathematical, logical, statistical, and linear algebra functions.
- **Foundation of Scientific Python Ecosystem:** Underpins libraries like Pandas, SciPy, and Matplotlib.


## 2. Installation <a name="installation"></a>

Before using NumPy, you need to install it. You can install it using `pip` or `conda`:
```bash
pip install numpy
# or for conda environments
conda install numpy
```

After installation, import the library, conventionally aliased as `np`:



In [4]:
import numpy as np

print(np.__version__)

1.26.4


## 3. Basic Array Creation <a name="array-creation"></a>

### Creating Arrays from Lists
NumPy arrays can be created from Python lists. This section demonstrates how to convert lists into `ndarray` objects:

In [6]:
# Create a 1D array from a list of numbers
arr1 = np.array([1, 2, 3, 4, 5])
print("1D array:", arr1)

# Create a 2D array (matrix) from nested lists
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
print("2D array:\n", arr2)

1D array: [1 2 3 4 5]
2D array:
 [[1 2 3]
 [4 5 6]]


### Special Arrays
NumPy provides functions to create arrays with special values:

In [7]:
# An array filled with zeros, shape 2x3
zeros = np.zeros((2, 3))
print("Zeros array:\n", zeros)

# An array filled with ones, shape 2x3
ones = np.ones((2, 3))
print("Ones array:\n", ones)

# Identity matrix 3x3
eye = np.eye(3)
print("Identity matrix:\n", eye)

# Empty array with uninitialized values
empty = np.empty((2, 2))
print("Empty array:\n", empty)

# Array of evenly spaced values: start, stop, step
arange = np.arange(0, 10, 2)
print("Arange array:", arange)

# Array of linearly spaced values between 0 and 1
linspace = np.linspace(0, 1, 5)
print("Linspace array:", linspace)

Zeros array:
 [[0. 0. 0.]
 [0. 0. 0.]]
Ones array:
 [[1. 1. 1.]
 [1. 1. 1.]]
Identity matrix:
 [[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Empty array:
 [[0. 0.]
 [0. 0.]]
Arange array: [0 2 4 6 8]
Linspace array: [0.   0.25 0.5  0.75 1.  ]


### Random Arrays
NumPy’s random module can create arrays with random values:

In [8]:
# Random floats in [0.0, 1.0)
rand = np.random.rand(2, 3)
print("Random array:\n", rand)

# Random integers between 0 and 9
randint = np.random.randint(0, 10, (2, 3))
print("Random integers:\n", randint)

Random array:
 [[0.03470396 0.24392306 0.36835776]
 [0.7249157  0.39326828 0.21139587]]
Random integers:
 [[3 6 5]
 [9 6 8]]


## 4. Array Attributes and Data Types <a name="attributes-dtypes"></a>

Every NumPy array has attributes describing its structure and data:

In [9]:
a = np.array([[1,2,3], [4,5,6]])

print("Number of dimensions (ndim):", a.ndim)     # Outputs: 2
print("Shape of array:", a.shape)                 # Outputs: (2, 3)
print("Total number of elements (size):", a.size) # Outputs: 6
print("Data type of elements (dtype):", a.dtype)# e.g., int64
print("Size of each element in bytes (itemsize):", a.itemsize)
print("Total bytes consumed by array (nbytes):", a.nbytes)

Number of dimensions (ndim): 2
Shape of array: (2, 3)
Total number of elements (size): 6
Data type of elements (dtype): int64
Size of each element in bytes (itemsize): 8
Total bytes consumed by array (nbytes): 48


### Data Types (dtypes)
You can explicitly specify data types when creating arrays.

- A number of scalar data types (`dtype`) are available.


- `int8`, `uint8`: 8-bit signed and unsigned integers. <br>
  Also `int16`,`uint16`, `int32`, `uint32`, and `int64`, `uint64`.


- `float32`: Single-precision floating point. <br>
  Also `float16` (half-precision) and `float64` (double-precision). <br>
  `float64` is compatible with Python's `float`.


- And there are others:<br>
  `complex64`, `complex128`, `complex256` <br>
  `bool`, `object`, `string_`, `unicode_`
  
  

In [11]:
# Create an array of float64 numbers
arr_float = np.array([1, 2, 3], dtype=np.float64)
print("Array dtype:", arr_float.dtype)

# Convert the array to int32
arr_int = arr_float.astype(np.int32)
print("Converted to int32:", arr_int, arr_int.dtype)

Array dtype: float64
Converted to int32: [1 2 3] int32


## 5. Array Operations <a name="operations"></a>

NumPy arrays allow element-wise arithmetic and support a suite of aggregate and mathematical operations.

### Element-wise Arithmetic

In [13]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Addition, subtraction, multiplication, division, exponentiation
print("a + b:", a + b)     # [5 7 9]
print("a - b:", a - b)     # [-3 -3 -3]
print("a * b:", a * b)     # [ 4 10 18]
print("a / b:", a / b)     # [0.25 0.4  0.5]
print("a ** 2:", a ** 2)   # [1 4 9]

a + b: [5 7 9]
a - b: [-3 -3 -3]
a * b: [ 4 10 18]
a / b: [0.25 0.4  0.5 ]
a ** 2: [1 4 9]


### Scalar Operations
Operations with scalars apply to each element:

In [14]:
print("a + 5:", a + 5)      # [6 7 8]
print("a * 2:", a * 2)      # [2 4 6]
print("sin(a):", np.sin(a)) # [0.84147098 0.90929743 0.14112001]

a + 5: [6 7 8]
a * 2: [2 4 6]
sin(a): [0.84147098 0.90929743 0.14112001]


### Aggregate Functions
Aggregate functions compute summaries across axes or the entire array:

In [15]:
c = np.array([[1, 2, 3], [4, 5, 6]])

print("Total sum:", c.sum())            # 21
print("Sum along columns:", c.sum(axis=0))  # [5 7 9]
print("Sum along rows:", c.sum(axis=1))     # [ 6 15]
print("Mean:", c.mean())                # 3.5
print("Minimum:", c.min())              # 1
print("Maximum:", c.max())              # 6

Total sum: 21
Sum along columns: [5 7 9]
Sum along rows: [ 6 15]
Mean: 3.5
Minimum: 1
Maximum: 6


### Element-wise Comparisons and Logical Operations
You can perform comparisons and logical operations that return boolean arrays:

In [18]:
d = np.array([1, 2, 3, 4, 5])
print("Elements greater than 3:", d > 3)  

print(d[d > 3])  # [4 5]

# Logical AND on conditions
print("Elements between 2 and 5:", np.logical_and(d > 2, d < 5))
# [False False True True False]

Elements greater than 3: [False False False  True  True]
[4 5]
Elements between 2 and 5: [False False  True  True False]


## 6. Universal Functions (ufuncs) <a name="ufuncs"></a>

Universal functions (ufuncs) are NumPy functions that operate element-wise on arrays.

In [19]:
a = np.array([0, np.pi/2, np.pi])
print("sin(a):", np.sin(a))     # [0. 1. 0.]
print("exp(a):", np.exp(a))     # [ 1.          4.81047738 23.14069263]
print("log(a+1):", np.log(a + 1))

sin(a): [0.0000000e+00 1.0000000e+00 1.2246468e-16]
exp(a): [ 1.          4.81047738 23.14069263]
log(a+1): [0.         0.94421571 1.42108041]


## 7. Indexing, Slicing, and Iterating <a name="indexing"></a>

Understanding how to extract and manipulate parts of arrays is crucial.

### Basic Indexing and Slicing

In [20]:
a = np.array([10, 20, 30, 40, 50])
print("First element:", a[0])
print("Slice 1:4:", a[1:4])
print("Reverse array:", a[::-1])

First element: 10
Slice 1:4: [20 30 40]
Reverse array: [50 40 30 20 10]


For multi-dimensional arrays:

In [21]:
b = np.array([[1,2,3],[4,5,6],[7,8,9]])
print("Element at row 0, col 1:", b[0,1])
print("Second row:", b[1,:])
print("Third column:", b[:,2])
print("Subarray (rows 1 to end, cols 1 to end):\n", b[1:, 1:])

Element at row 0, col 1: 2
Second row: [4 5 6]
Third column: [3 6 9]
Subarray (rows 1 to end, cols 1 to end):
 [[5 6]
 [8 9]]


### Boolean and Fancy Indexing
Boolean indexing filters elements based on conditions, while fancy indexing uses arrays of indices.

In [22]:
arr = np.array([1,2,3,4,5])
# Boolean indexing: select elements > 3
filtered = arr[arr > 3]
print("Elements > 3:", filtered)

# Fancy indexing: select by a list of indices
indices = [0, 2, 4]
selected = arr[indices]
print("Selected elements:", selected)

Elements > 3: [4 5]
Selected elements: [1 3 5]


### Iterating Over Arrays
While vectorized operations are preferred, sometimes iteration is necessary.

In [24]:
# Iterating over rows in a 2D array
for row in b:
    print("Row:", row)

# Iterating over each element using nditer for efficiency
for element in np.nditer(b): # Flattening the array
    print(element, end=' ')
print()  # newline after iteration

Row: [1 2 3]
Row: [4 5 6]
Row: [7 8 9]
1 2 3 4 5 6 7 8 9 


## 8. Shape Manipulation <a name="shape"></a>

Reshaping, transposing, and flattening arrays are common tasks.

### Reshaping Arrays

In [25]:
a = np.arange(12)    # Array with elements 0 to 11
b = a.reshape((3, 4))  
print("Reshaped array (3x4):\n", b)

Reshaped array (3x4):
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


### Transpose and Swapaxes

![image.png](attachment:image.png)

https://www.andreaminini.net/math/transpose-of-a-matrix

![image-2.png](attachment:image-2.png)

In [26]:
print("Transpose of b:\n", b.T)

# For multi-dimensional arrays, use swapaxes
c = np.random.rand(2, 3, 4)
# Swap axes 1 and 2
c_swapped = c.swapaxes(1, 2)
print("Shape after swapaxes:", c_swapped.shape)

Transpose of b:
 [[ 0  4  8]
 [ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]]
Shape after swapaxes: (2, 4, 3)


### Flattening 

In [28]:
flat = b.flatten()    # Returns a new, contiguous flattened array
print("Flattened:", flat)



Flattened: [ 0  1  2  3  4  5  6  7  8  9 10 11]


### Adding New Axes
Expand array dimensions for broadcasting or matching shapes:

In [30]:
x = np.array([1,2,3])
print("Original shape:", x.shape)
x_new = x[:, np.newaxis]  # Equivalent to x.reshape(3,1)
print("New shape:", x_new.shape)

Original shape: (3,)
New shape: (3, 1)


## 9. Combining and Splitting Arrays <a name="combining-splitting"></a>

### Concatenation and Stacking

In [32]:
a = np.array([1,2,3])
b = np.array([4,5,6])

# Concatenating 1D arrays
combined = np.concatenate((a,b))
print("Concatenated:", combined)

# For multi-dimensional arrays, specify axis
A = np.array([[1,2],[3,4]])
B = np.array([[5,6]])
concat_axis0 = np.concatenate((A,B), axis=0)
print("Concatenated along axis0:\n", concat_axis0)



Concatenated: [1 2 3 4 5 6]
Concatenated along axis0:
 [[1 2]
 [3 4]
 [5 6]]


In [34]:


a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

result = np.vstack((a, b))
print(result)


[[1 2 3]
 [4 5 6]]


In [35]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

result = np.hstack((a, b))
print(result)


[1 2 3 4 5 6]


In [36]:
a = np.array([[1], [2], [3]])
b = np.array([[4], [5], [6]])

result = np.hstack((a, b))
print(result)


[[1 4]
 [2 5]
 [3 6]]


### Splitting Arrays

In [37]:
arr = np.arange(10)
# Split into 2 equal parts
split_arrays = np.array_split(arr, 2)
print("Split arrays:", split_arrays)

# For multi-dimensional arrays, vertical/horizontal splits
mat = np.arange(16).reshape(4,4)
upper, lower = np.vsplit(mat, 2)
print("Upper half:\n", upper)
print("Lower half:\n", lower)

Split arrays: [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9])]
Upper half:
 [[0 1 2 3]
 [4 5 6 7]]
Lower half:
 [[ 8  9 10 11]
 [12 13 14 15]]


## 10. Copy vs. View <a name="copy-view"></a>

It is important to understand when operations create a copy of data vs. a view (reference) to the same data.

In [39]:
original = np.array([1,2,3])
view = original.view()
copy = original.copy()

print("Original:", original)
# Modify the view
view[0] = 10
print("Original after modifying view:", original)  
# A change in the view changes the original

# Modify the copy
copy[1] = 20
print("Original after modifying copy:", original)
# The copy does not affect the original

Original: [1 2 3]
Original after modifying view: [10  2  3]
Original after modifying copy: [10  2  3]


## 11. Broadcasting <a name="broadcasting"></a>

![image.png](attachment:image.png)

Broadcasting allows arithmetic operations between arrays of different shapes by "stretching" them appropriately.

![image-2.png](attachment:image-2.png)
https://miro.medium.com/v2/resize:fit:1400/1*lY8Ve6Uz_bqVI5NPh5RPZA.png

In [41]:
import numpy as np

# Example arrays
a = np.array([1, 2, 3])          # Shape (3,) or (1, 3) if reshaped
b = np.array([[10], [20], [30]]) # Shape (3, 1)

# Multiplying several columns at once
result_mult = a * b  # Broadcasting occurs here
print("Multiplying columns result:\n", result_mult)



Multiplying columns result:
 [[10 20 30]
 [20 40 60]
 [30 60 90]]


## 12. Linear Algebra with NumPy <a name="linear-algebra"></a>

NumPy’s `linalg` module provides linear algebra routines.

In [42]:
# Define matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[2, 0], [1, 2]])

# Matrix multiplication
C = A @ B  # or np.dot(A, B)
print("Matrix product:\n", C)

# Determinant of A
detA = np.linalg.det(A)
print("Determinant of A:", detA)

# Inverse of A
invA = np.linalg.inv(A)
print("Inverse of A:\n", invA)


Matrix product:
 [[ 4  4]
 [10  8]]
Determinant of A: -2.0000000000000004
Inverse of A:
 [[-2.   1. ]
 [ 1.5 -0.5]]


## 13. Advanced Linear Algebra Functions <a name="advanced-linalg"></a>

Beyond basics, explore more linear algebra tools:

In [43]:
# QR Decomposition
Q, R = np.linalg.qr(A)
print("Q matrix:\n", Q)
print("R matrix:\n", R)

# Singular Value Decomposition (SVD)
U, s, Vh = np.linalg.svd(A)
print("U matrix:\n", U)
print("Singular values:", s)
print("Vh matrix:\n", Vh)

# Cholesky Decomposition (for symmetric positive definite matrices)
M = np.array([[4, 2],[2, 3]])
L = np.linalg.cholesky(M)
print("Cholesky factor L:\n", L)

# Least squares solution (overdetermined system)
b = np.array([1, 2])
x, residuals, rank, s = np.linalg.lstsq(A, b, rcond=None)
print("Least squares solution x:", x)

Q matrix:
 [[-0.31622777 -0.9486833 ]
 [-0.9486833   0.31622777]]
R matrix:
 [[-3.16227766 -4.42718872]
 [ 0.         -0.63245553]]
U matrix:
 [[-0.40455358 -0.9145143 ]
 [-0.9145143   0.40455358]]
Singular values: [5.4649857  0.36596619]
Vh matrix:
 [[-0.57604844 -0.81741556]
 [ 0.81741556 -0.57604844]]
Cholesky factor L:
 [[2.         0.        ]
 [1.         1.41421356]]
Least squares solution x: [1.50464292e-16 5.00000000e-01]


## 14. Random Module and Random Numbers <a name="random"></a>

Creating reproducible random sequences and drawing from distributions:

In [50]:

# Uniform distribution on [0,1)
random_array = np.random.rand(2,3)
print("Random uniform array:\n", random_array)

# Random integers from 0 to 99
random_ints = np.random.randint(0, 100, size=(2,4))
print("Random integers:\n", random_ints)

# Normal distribution (mean=0, std=1)
normal_samples = np.random.randn(2,3)
print("Normal distribution samples:\n", normal_samples)

Random uniform array:
 [[0.64589411 0.43758721 0.891773  ]
 [0.96366276 0.38344152 0.79172504]]
Random integers:
 [[87 46 88 81]
 [37 25 77 72]]
Normal distribution samples:
 [[ 0.48431215  0.57914048 -0.18158257]
 [ 1.41020463 -0.37447169  0.27519832]]


In [51]:
np.random.seed(0)  # Set seed for reproducibility


# Example 1: Same seed produces same random numbers
print("Example 1: Reproducibility with same seed")
print("-" * 40)

np.random.seed(0)
print("First run with seed(0):")
print(np.random.rand(5))  # Generate 5 random numbers

np.random.seed(0)  # Reset to same seed
print("\nSecond run with seed(0):")
print(np.random.rand(5))  # Will generate the same 5 numbers


Example 1: Reproducibility with same seed
----------------------------------------
First run with seed(0):
[0.5488135  0.71518937 0.60276338 0.54488318 0.4236548 ]

Second run with seed(0):
[0.5488135  0.71518937 0.60276338 0.54488318 0.4236548 ]


### Structured Arrays
Structured arrays hold heterogeneous data, similar to records or rows in a table.

In [46]:
dt = np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f4')])
data = np.array([('Alice', 25, 95.5), ('Bob', 30, 88.0)], dtype=dt)
print("Names:", data['name'])
print("Ages:", data['age'])
print("Scores:", data['score'])

Names: ['Alice' 'Bob']
Ages: [25 30]
Scores: [95.5 88. ]


## <a name="vectorization"></a>Vectorizaton

- Vectorization is the process of operating on entire arrays rather than individual elements. 
- Processors have vectorized instructions which speed up these operations.
- Vectorized functions in NumPy execute much faster as they take advantage of acceleration available at the hardware level.
- Vectorization mantra: _Forget loops, think of the entire array._


In [52]:
import numpy as np
import time

# Create two arrays
size = 1000000
arr1 = np.random.rand(size)
arr2 = np.random.rand(size)

# Traditional loop approach
def multiply_with_loop(a, b):
    result = np.zeros_like(a)
    for i in range(len(a)):
        result[i] = a[i] * b[i]
    return result

# Time the loop approach
start_time = time.time()
result_loop = multiply_with_loop(arr1, arr2)
loop_time = time.time() - start_time
print(f"Loop multiplication time: {loop_time:.4f} seconds")

# Vectorized approach
start_time = time.time()
result_vector = arr1 * arr2  # NumPy vectorized multiplication
vector_time = time.time() - start_time
print(f"Vectorized multiplication time: {vector_time:.4f} seconds")
print(f"Speedup: {loop_time/vector_time:.1f}x")

# Verify both approaches give the same result
print("\nBoth approaches give the same result:", 
      np.allclose(result_loop, result_vector))

# Show first few results
print("\nFirst 5 elements of arrays:")
print("Array 1:", arr1[:5])
print("Array 2:", arr2[:5])
print("Result:", result_vector[:5])

Loop multiplication time: 0.4576 seconds
Vectorized multiplication time: 0.0030 seconds
Speedup: 152.2x

Both approaches give the same result: True

First 5 elements of arrays:
Array 1: [0.64589411 0.43758721 0.891773   0.96366276 0.38344152]
Array 2: [0.27965023 0.95573175 0.03659582 0.60934925 0.59753225]
Result: [0.18062444 0.41821599 0.03263517 0.58720718 0.22911867]


## <a name="exercise"></a>Exercise: Working with Text Data

- Use the function below to read the file 'languages.txt' into a Python list.
- Convert the list to a NumPy array and clean it up using appropriate [NumPy String functions](https://numpy.org/doc/stable/reference/routines.char.html):
  - get rid of empty entries
  - convert everything to lower case
  - get rid of leading and trailing whitespace characters
- Use appropriate [NumPy functions](https://numpy.org/doc/stable/reference/routines.array-manipulation.html) to perform a tally of all the unique responses.



In [20]:
def freadToList(fname, sep='\n'):
    file = open(fname, 'rt', encoding='utf8')
    text = file.read()
    file.close()
    # split based on provided separator
    return text.split(sep=sep)

In [None]:
import numpy as np

# Read the file using the provided function
def freadToList(fname, sep='\n'):
    file = open(fname, 'rt', encoding='utf8')
    text = file.read()
    file.close()
    # split based on provided separator
    return text.split(sep=sep)
    
# Read data into list
languages_list = freadToList('languages.txt')

# Convert to numpy array
languages = np.array(languages_list)

# Clean the data 
clean_languages = np.char.strip(np.char.lower(languages))

# Get unique values and counts
unique, counts = np.unique(clean_languages, return_counts=True)

# Print results
print("Language frequencies:")
print("-" * 20)
for lang, count in zip(unique, counts):
    print(f"{lang}: {count}")

print(f"\nTotal unique languages: {len(unique)}")
print(f"Total responses: {len(clean_languages)}")