# Python workshop - 2025

<div>
    <img src="../images/qcbs_logo_v2.svg" style="background-color: #f0f0f0; padding: 20px;"/>
</div>

<div>
    <img src="../images/python_logo_generic.svg" style="background-color: #f0f0f0; padding: 20px;"/>
</div>

**Last update**: 2025-05-19  
**Author**: El-Amine Mimouni  
**Affiliation**: Québec Centre for Biodiversity Science

**Overview**: In this notebook, we will see how to work with NumPy arrays.

---

# NumPy

NumPy (Numerical Python) is the core library for numerical and scientific computing in Python. It is the main computational workhorse behind much of the compuations that are done in Python. It provides powerful support for multi-dimensional arrays and a wide range of mathematical operations.

If you want to learn more about it, visit [https://numpy.org/](https://numpy.org/).

In [None]:
# Import numpy
import numpy as np

# Creating Arrays

A NumPy array is similar to a mathematical vector or a matrix. It provides a powerful way to work with numerical data in Python. NumPy arrays are highly efficient for performing mathematical operations and offer more flexibility compared to regular Python lists. Arrays in NumPy can be constructed in many ways.

In [None]:
# Create a 1-dimensional array (i.e. a vector)
arr_1d = np.array(object=[1, 2, 3, 4, 5])

print("Value of arr_1d:", arr_1d)
print("Type of arr_1d:", type(arr_1d))

# Important attributes
print("\nShape of arr_1d:", arr_1d.shape)
print("Data type of arr_1d:", arr_1d.dtype)

In [None]:
# Create a 2-dimensional array (i.e. a matrix)
# Note the fact that you are giving each row as a list in a list [[row1], [row2]].
arr_2d = np.array(object=[[1.2, 2.5, 3.1], [4.8, 5.1, 6.5]])

print("Value of arr_2d:\n", arr_2d)
print("Type of arr_2d:", type(arr_2d))

# Important attributes
print("\nShape of arr_2d:", arr_2d.shape)
print("Data type of arr_2d:", arr_2d.dtype)

In [None]:
# Contrary to lists, NumPy arrays behave correctly with regards to the basic operators.
print("The result of multiplying every value of arr_1d by 2:")
print(arr_1d * 2)
#
print("\nThe result of adding 5.8 to every value of arr_2d:")
print(arr_2d + 5.8)

In [None]:
# If you have doubts, create the list-equivalent of arr_1d
list_1d = [1, 2, 3, 4, 5]
#
print("Values in list_1d:", list_1d)
print("Values in arr_1d:", arr_1d)

# Look at how they react with the + operator
print("\nlist_1d.__add__:", list_1d.__add__)
print("arr_1d.__add__:", arr_1d.__add__)

# Slicing and accessing elements

It is done like with conventional Python lists.

In [None]:
# The most general form
# Select everything
print(arr_2d[:, :])

In [None]:
# Input the indices of the rows you want to select
print("The first row of arr_2d:")
print(arr_2d[0, :])

# Input the indices of the rows you want to select
print("\nThe third column of arr_2d:")
print(arr_2d[:, 2])


In [None]:
# The : and even , can be omitted in the case of rows
# But I recommend leaving them for clarity
# Also since it clearly shows the dimension of your array
print("The result of arr_2d[0, :]:")
print(arr_2d[0, :])

print("\nThe result of arr_2d[0,]:")
print(arr_2d[0,])

print("\nThe result of arr_2d[0]:")
print(arr_2d[0])

In [None]:
# If you want to select a range of rows or columns, use the colon :
print("Rows 1 to 2 of arr_2d:")
print(arr_2d[0:2, :])
#
print("\nColumns 2 to 3 of arr_2d:")
print(arr_2d[:, 1:4])

In [None]:
# If you particular values, you can input them as lists:
print("Rows 1 to 2, and columns 1 and 3 of arr_2d:")
print(arr_2d[0:2, [0, 2]])


# Important methods

In [None]:
# Each matrix has the usual mathematical methods
# These are .mean(), .min(), .max()

# Note: axis=None can be left as an empty field.
print("Grand mean of arr_2d:")
print(arr_2d.mean(axis=None))
#
print("\nColumn means of arr_2d:")
print(arr_2d.mean(axis=0))
#
print("\nRow means of arr_2d:")
print(arr_2d.mean(axis=1))

In [None]:
# Special notice needs to be mentionned regarding the variance/stdev
# You can calculate it by hand as shown below:
print("Variance of variables in arr_2d:")
print(arr_2d.var(axis=0))

# It is different than what you would obtain by hand.
# The reason for this difference is that NumPy considers
# the MLE estimate of the variance.
# Therefore, sor a sample of N observations, the estimate will be
# divided by N rather than (N - 1).

In [None]:
# This can be seen in the np.cov() function:

# The default considers rowvar=True, so that the variance of the rows
# is computed. If your variables are in the columns, always use
# rowvar=False.

# Thankfully for most analyses, the value of bias=False is the default
print("\nResult of np.cov() with rowvar=False and bias=True:")
print(np.cov(arr_2d, rowvar=False, bias=True))

# Thankfully for most analyses, the value of bias=False is the default
print("\nResult of np.cov() with rowvar=False and bias=False:")
print(np.cov(arr_2d, rowvar=False, bias=False))

# Reading data

In [None]:
# The function works on other formats besides .txt BTW
mite_env = np.genfromtxt(fname="../data/mite_env.csv", skip_header=1, delimiter=",")

# Print the first five lines
print("The first five lines of mite_env:")
print(mite_env[0:5,])

In [None]:
print("Dimensions of mite_env:", mite_env.shape)
print("Data type of mite_env:", mite_env.dtype)

In [None]:
# Numpy is not well-suited for qualitative variables
# Extract only the quantitative variables
mite_env_quant = mite_env[:, 0:2]

# Print the first five lines
print("The first five lines of mite_env_quant:")
print(mite_env_quant[0:5,])

# Mini-matrix primer

In [None]:
# Create a small matrix
mat1 = np.array([[2, 4],
                 [1, 6],
                 [5, 3]])

print("Values in mat1:\n", mat1)
print("\nShape of mat1:", mat1.shape)

In [None]:
# The transpose of a matrix is defined as the same matrix
# but with rows and columns inverted
# This is an attribute .T

print("Values in mat1.T:\n", mat1.T)
print("\nShape of mat1.T:", mat1.T.shape)

In [None]:
# A vector is a matrix but with a single dimension
# It can be a 1xp row vector or a px1 column vector
# When entering values into NumPy, mind the [[]] notation

# NumPy assumes that when you give it a vector, it is a
# 1xp row vector

vec1 = np.array(object=[[3], [2]])

print("Values in vec1:\n", vec1)
print("\nShape of vec1:", vec1.shape)

In [None]:
# If the number of columns in the first matrix
# matches the number of columns in the second
# matrix, the product can be computed.

print("Shape of mat1:", mat1.shape)
print("\nShape of vec1:", vec1.shape)

In [None]:
print("Values in mat1:\n", mat1)
print("Values in vec1:\n", vec1)

In [None]:
# The matrix product in Python can be done
# as a method or with an operator

print("Matrix product as a method:")
print(mat1.dot(b=vec1))
#
print("\nMatrix product with an operator:")
print(mat1 @ vec1)

In [None]:
# Variance-covariance matrices can be obtained
# very efficiently using matrix algebra.

# Get the number of observations
n = mat1.shape[0]
# Center the data
mat1_c = (mat1 - mat1.mean(axis=0))
# Compute the variance-covariance matrix
S = 1.0 / (n - 1.0) * mat1_c.T @ mat1_c

print("Variance-covariance matrix:")
print(S)

In [None]:
# Special arrays can be built for linear algebra

# Zeros
zeros = np.zeros((2, 2))
print("A 2x2 matrix of 0's:")
print(zeros)

# Ones
ones = np.ones((4, 1))
print("\nA 4x1 matrix of 1's:")
print(ones)

# Identity matrix
print("\nA 3x3 identity matrix:")
print(np.eye(N=3))

In [None]:
# By supplying matrices into np.hstack() (horizontal stack),
# you can concatenate matrices together.
# Can be useful for linear regression for example.

# Get the number of observations and generate a column
# matrix of 1's 
n = mite_env_quant.shape[0]
ones_n = np.ones((n, 1))

# Concatenate a column matrix of 1's
mite_env_con1 = np.hstack(tup=[ones_n, mite_env_quant])

# Show the first five values
print("The first five values of mite_env_con1:")
print(mite_env_con1[0:5, ])

# There is also np.vstack() (vertical stack) for concatenating observations by
# columns.

# If you know R, these are similar to rbind() and cbind().

# LINALG


In [None]:
#np.linalg.cholesky()
#np.linalg.eig()
#np.linalg.qr()
#np.linalg.svd()
#np.linalg.inv()

In [None]:
# Create a square matrix that could be a covariance matrix
# between two variables
S = np.array([[1.0, 0.8],
              [0.8, 1.0]])

In [None]:
# Compute its determinant
print("The determinant of S:")
print(np.linalg.det(a=S))
print(type(np.linalg.det(a=S)))

In [None]:
# Get the inverse of the S matrix
Sm1 = np.linalg.inv(a=S)

print("The inverse of S:")
print(Sm1)
print(type(Sm1))

In [None]:
print("\nThe result of Sm1 x S:")
print(Sm1 @ S)
print(type(Sm1 @ S))

print("\nThe result of S x Sm1:")
print(S @ Sm1)
print(type(S @ Sm1))

In [None]:
# Invert matrix S2 which is singular
# Uncomment at your own risk
# (There ain't no risk, it's mathematically impossible)
#np.linalg.inv(S2)

# The return of list unpacking

In [None]:
# Perform eigenanalysis of S
print(np.linalg.eig(a=S))

In [None]:
# So you'd better catch them as:
out_values, out_vectors = np.linalg.eig(a=S)

# Look at them!
print("\nThe eigenvalues:")
print(out_values)

print("\nThe eigenvectors:")
print(out_vectors)

In [None]:
# You can now write a small function for PCA
def pca(X):
    n = X.shape[0]
    Xc = X - X.mean(axis=0)
    S = 1.0 / (n - 1.0) * Xc.T @ Xc
    lam, U = np.linalg.eig(a=S)
    F = Xc @ U
    return [lam, U, F]

In [None]:
# Apply it on the mite_env_quant dataset
pca(X=mite_env_quant)

# Note: You would actually need to standardize
# variables, but this is for explanatory purposes.

# Masked matrices

In [None]:
# Generate a 5x5 matrix with values either -1 or 1
ex_array = np.random.choice(a=[0.0, 1.0, 2.0, 3.0, -999.0], size=(5, 5))

# See the values in my array
print("Array with some values as -999")
print(ex_array)

# Create a copy of the array and
# Replace values that are equal to -999 with np.nan
ex_nan = ex_array.copy()
ex_nan[ex_nan == -999] = np.nan

# See the values in the mask
print("\nArray with -999 coded as np.nan:")
print(ex_nan)

In [None]:
# Determine a boolean mask defined by whether or not
# values are equal to -999
mymask = ex_array == -999

# See the values in the mask
print("Boolean mask:")
print(mymask)

# Create a masked array from this mask
print("\nMasked array:")
ex_mask = np.ma.masked_array(data=ex_array, mask=mymask)
print(ex_mask)

In [None]:
# Print out the mean of these arrays
print("The mean of ex_array is:", ex_array.mean())
print("The mean of ex_nan is:", ex_nan.mean())
print("The nanmean of ex_nan is:", np.nanmean(ex_nan))
print("The mean of ex_mask is:", ex_mask.mean())

# Views and copies

In [None]:
# Create two vectors
vec_1 = np.array([1, 2, 3, 4, 5])
vec_2 = vec_1[2:]
vec_3 = vec_1[2:].copy()

# Look at them!
print(vec_1)
print(vec_2)
print(vec_3)

In [None]:
print("ID of vec_1:", id(vec_1))
print("ID of vec_2:", id(vec_2))
print("ID of vec_3:", id(vec_3))

In [None]:
print("Does vec_1 share memory with vec_2?")
print(np.shares_memory(vec_1, vec_2))

print("\nDoes vec_1 share memory with vec_3?")
print(np.shares_memory(vec_1, vec_3))

print("\nDoes vec_2 share memory with vec_3?")
print(np.shares_memory(vec_2, vec_3))

In [None]:
# Change a value in vec_2
vec_2[1] = 9999

# Look at them!
print("Values of vec_1:")
print(vec_1)

print("\nValues of vec_2:")
print(vec_2)

print("\nValues of vec_3:")
print(vec_3)

In [None]:
# So ask yourself when subsetting with NumPy:

# Will I do some analyses on this part and then go back to the original data?

# - If YES: Consider a .copy() of the data so you don't alter it unintentionally.
# - If NO: You can stick with a view, it is more memory-efficient (i.e. you weren't going to use it anyways).