# Python workshop - 2025

<div>
    <img src="../images/qcbs_logo_v2.svg" style="background-color: #f0f0f0; padding: 20px;"/>
</div>

<div>
    <img src="../images/python_logo_generic.svg" style="background-color: #f0f0f0; padding: 20px;"/>
</div>

**Last update**: 2025-11-21  
**Author**: El-Amine Mimouni  
**Affiliation**: Qu√©bec Centre for Biodiversity Science

**Overview**: In this notebook, we will see how to work with NumPy arrays.

---

# NumPy

NumPy (Numerical Python) is the core library for numerical and scientific computing in Python. It is the main computational workhorse behind much of the compuations that are done in Python. It provides powerful support for multi-dimensional arrays and a wide range of mathematical operations.

If you want to learn more about it, visit [https://numpy.org/](https://numpy.org/).

In [1]:
# Import numpy
import numpy as np

# Creating Arrays

A NumPy array is similar to a mathematical vector or a matrix. It provides a powerful way to work with numerical data in Python. NumPy arrays are highly efficient for performing mathematical operations and offer more flexibility compared to regular Python lists. Arrays in NumPy can be constructed in many ways.

In [2]:
# Create a 1-dimensional array (i.e. a vector)
arr_1d = np.array(object=[1, 2, 3, 4, 5])

print("Value of arr_1d:", arr_1d)
print("Type of arr_1d:", type(arr_1d))

# Important attributes
print("\nShape of arr_1d:", arr_1d.shape)
print("Data type of arr_1d:", arr_1d.dtype)

Value of arr_1d: [1 2 3 4 5]
Type of arr_1d: <class 'numpy.ndarray'>

Shape of arr_1d: (5,)
Data type of arr_1d: int64


In [3]:
# Create a 2-dimensional array (i.e. a matrix)
# Note the fact that you are giving each row as a list in a list [[row1], [row2]].
arr_2d = np.array(object=[[1.2, 2.5, 3.1], [4.8, 5.1, 6.5]])

print("Value of arr_2d:\n", arr_2d)
print("Type of arr_2d:", type(arr_2d))

# Important attributes
print("\nShape of arr_2d:", arr_2d.shape)
print("Data type of arr_2d:", arr_2d.dtype)

Value of arr_2d:
 [[1.2 2.5 3.1]
 [4.8 5.1 6.5]]
Type of arr_2d: <class 'numpy.ndarray'>

Shape of arr_2d: (2, 3)
Data type of arr_2d: float64


In [4]:
# Contrary to lists, NumPy arrays behave correctly with regards to the basic operators.
print("The result of multiplying every value of arr_1d by 2:")
print(arr_1d * 2)
#
print("\nThe result of adding 5.8 to every value of arr_2d:")
print(arr_2d + 5.8)

The result of multiplying every value of arr_1d by 2:
[ 2  4  6  8 10]

The result of adding 5.8 to every value of arr_2d:
[[ 7.   8.3  8.9]
 [10.6 10.9 12.3]]


In [5]:
# If you have doubts, create the list-equivalent of arr_1d
list_1d = [1, 2, 3, 4, 5]
#
print("Values in list_1d:", list_1d)
print("Values in arr_1d:", arr_1d)

# Look at how they react with the + operator
print("\nlist_1d.__add__:", list_1d.__add__)
print("arr_1d.__add__:", arr_1d.__add__)

Values in list_1d: [1, 2, 3, 4, 5]
Values in arr_1d: [1 2 3 4 5]

list_1d.__add__: <method-wrapper '__add__' of list object at 0x00000270A0DEA580>
arr_1d.__add__: <method-wrapper '__add__' of numpy.ndarray object at 0x00000270A0EC1110>


# Slicing and accessing elements

It is done like with conventional Python lists.

In [8]:
# The most general form
# Select everything
print(arr_2d[:, :])

[[1.2 2.5 3.1]
 [4.8 5.1 6.5]]


In [None]:
# Input the indices of the rows you want to select
print("The first row of arr_2d:")
print(arr_2d[0, :])

# Input the indices of the rows you want to select
print("\nThe third column of arr_2d:")
print(arr_2d[:, 2])

The first row of arr_2d:
[1.2 2.5 3.1]

The third column of arr_2d:
[3.1 6.5]


In [10]:
# The : and even , can be omitted in the case of rows
# But I recommend leaving them for clarity
# Also since it clearly shows the dimension of your array
print("The result of arr_2d[0, :]:")
print(arr_2d[0, :])

print("\nThe result of arr_2d[0,]:")
print(arr_2d[0,])

print("\nThe result of arr_2d[0]:")
print(arr_2d[0])

The result of arr_2d[0, :]:
[1.2 2.5 3.1]

The result of arr_2d[0,]:
[1.2 2.5 3.1]

The result of arr_2d[0]:
[1.2 2.5 3.1]


In [11]:
# If you want to select a range of rows or columns, use the colon :
print("Rows 1 to 2 of arr_2d:")
print(arr_2d[0:2, :])
#
print("\nColumns 2 to 3 of arr_2d:")
print(arr_2d[:, 1:4])

Rows 1 to 2 of arr_2d:
[[1.2 2.5 3.1]
 [4.8 5.1 6.5]]

Columns 2 to 3 of arr_2d:
[[2.5 3.1]
 [5.1 6.5]]


In [None]:
# If you want particular values, you can input them as lists:
print("Rows 1 to 2, and columns 1 and 3 of arr_2d:")
print(arr_2d[0:2, [0, 2]])

Rows 1 to 2, and columns 1 and 3 of arr_2d:
[[1.2 3.1]
 [4.8 6.5]]


# Important methods

In [15]:
# Each matrix has the usual mathematical methods
# These are .mean(), .min(), .max()

# Note: axis=None can be left as an empty field.
print("Grand mean of arr_2d:")
print(arr_2d.mean(axis=None))
#
print("\nColumn means of arr_2d:")
print(arr_2d.mean(axis=0))
#
print("\nRow means of arr_2d:")
print(arr_2d.mean(axis=1))

Grand mean of arr_2d:
3.866666666666667

Column means of arr_2d:
[3.  3.8 4.8]

Row means of arr_2d:
[2.26666667 5.46666667]


In [17]:
# Special notice needs to be mentionned regarding the variance/stdev
# You can calculate it by hand as shown below:
print("Variance of variables in arr_2d:")
print(arr_2d.var(axis=0))

# It is different than what you would obtain by hand.
# The reason for this difference is that NumPy considers
# the MLE estimate of the variance.
# Therefore, for a sample of N observations, the estimate will be
# divided by N rather than (N - 1).

Variance of variables in arr_2d:
[3.24 1.69 2.89]


In [18]:
# This can be seen in the np.cov() function:

# The default considers rowvar=True, so that the variance of the rows is computed.
# If your variables are in the columns, always use rowvar=False.

# Thankfully for most analyses, the value of bias=False is the default
print("\nResult of np.cov() with rowvar=False and bias=True:")
print(np.cov(arr_2d, rowvar=False, bias=True))

# Thankfully for most analyses, the value of bias=False is the default
print("\nResult of np.cov() with rowvar=False and bias=False:")
print(np.cov(arr_2d, rowvar=False, bias=False))


Result of np.cov() with rowvar=False and bias=True:
[[3.24 2.34 3.06]
 [2.34 1.69 2.21]
 [3.06 2.21 2.89]]

Result of np.cov() with rowvar=False and bias=False:
[[6.48 4.68 6.12]
 [4.68 3.38 4.42]
 [6.12 4.42 5.78]]


# Reading data

In [19]:
# The function works on other formats besides .txt BTW
mite_env = np.genfromtxt(fname="../data/mite_env.csv", skip_header=1, delimiter=",")

# Print the first five lines
print("The first five lines of mite_env:")
print(mite_env[0:5,])

The first five lines of mite_env:
[[ 39.18 350.15    nan    nan    nan]
 [ 54.99 434.81    nan    nan    nan]
 [ 46.07 371.72    nan    nan    nan]
 [ 48.19 360.5     nan    nan    nan]
 [ 23.55 204.13    nan    nan    nan]]


In [20]:
print("Dimensions of mite_env:", mite_env.shape)
print("Data type of mite_env:", mite_env.dtype)

Dimensions of mite_env: (70, 5)
Data type of mite_env: float64


In [22]:
# Numpy is not well-suited for qualitative variables
# Extract only the quantitative variables
mite_env_quant = mite_env[:, 0:2]

# Print the first five lines
print("The first five lines of mite_env_quant:")
print(mite_env_quant[0:5,])

The first five lines of mite_env_quant:
[[ 39.18 350.15]
 [ 54.99 434.81]
 [ 46.07 371.72]
 [ 48.19 360.5 ]
 [ 23.55 204.13]]


# Mini-matrix primer

In [23]:
# Create a small matrix
mat1 = np.array([[2, 4],
                 [1, 6],
                 [5, 3]])

print("Values in mat1:\n", mat1)
print("\nShape of mat1:", mat1.shape)

Values in mat1:
 [[2 4]
 [1 6]
 [5 3]]

Shape of mat1: (3, 2)


In [None]:
# The transpose of a matrix is defined as the same matrix but with rows and columns inverted.
# This is an attribute .T

print("Values in mat1.T:\n", mat1.T)
print("\nShape of mat1.T:", mat1.T.shape)

Values in mat1.T:
 [[2 1 5]
 [4 6 3]]

Shape of mat1.T: (2, 3)


In [None]:
# A vector is a matrix but with a single dimension
# It can be a 1xp row vector or a px1 column vector

# When entering vector values into NumPy, mind the [[]] notation
# This is different than [] in that you have a 2-dimensional array, but with just one column or row.

# NumPy assumes that when you give it a vector, it is a
# 1xp row vector

vec1 = np.array(object=[[3], [2]])

print("Values in vec1:\n", vec1)
print("\nShape of vec1:", vec1.shape)

Values in vec1:
 [[3]
 [2]]

Shape of vec1: (2, 1)


In [27]:
# If the number of columns in the first matrix matches the number of columns in the second matrix, the product can be computed.

print("Shape of mat1:", mat1.shape)
print("\nShape of vec1:", vec1.shape)

Shape of mat1: (3, 2)

Shape of vec1: (2, 1)


In [28]:
print("Values in mat1:\n", mat1)
print("Values in vec1:\n", vec1)

Values in mat1:
 [[2 4]
 [1 6]
 [5 3]]
Values in vec1:
 [[3]
 [2]]


In [29]:
# The matrix product in Python can be done
# as a method or with an operator

print("Matrix product as a method:")
print(mat1.dot(b=vec1))
#
print("\nMatrix product with an operator:")
print(mat1 @ vec1)

Matrix product as a method:
[[14]
 [15]
 [21]]

Matrix product with an operator:
[[14]
 [15]
 [21]]


In [31]:
# Variance-covariance matrices can be obtained very efficiently using matrix algebra.

# Get the number of observations
n = mat1.shape[0]
# Center the data
mat1_c = (mat1 - mat1.mean(axis=0))
# Compute the variance-covariance matrix
S = 1.0 / (n - 1.0) * mat1_c.T @ mat1_c

print("Variance-covariance matrix:")
print(S)

Variance-covariance matrix:
[[ 4.33333333 -2.83333333]
 [-2.83333333  2.33333333]]


In [34]:
# Just to make sure, you can compare it with the result given by np.cov().
#
print(np.cov(mat1, rowvar=False, bias=False))

[[ 4.33333333 -2.83333333]
 [-2.83333333  2.33333333]]


In [35]:
# Special arrays can be built for linear algebra

# Zeros
zeros = np.zeros((2, 2))
print("A 2x2 matrix of 0's:")
print(zeros)

# Ones
ones = np.ones((4, 1))
print("\nA 4x1 matrix of 1's:")
print(ones)

# Identity matrix
print("\nA 3x3 identity matrix:")
print(np.eye(N=3))

A 2x2 matrix of 0's:
[[0. 0.]
 [0. 0.]]

A 4x1 matrix of 1's:
[[1.]
 [1.]
 [1.]
 [1.]]

A 3x3 identity matrix:
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [None]:
# By supplying matrices into np.hstack() (horizontal stack),
# you can concatenate matrices together.
# Can be useful for linear regression for example.

# Get the number of observations and generate a column
# matrix of 1's 
n = mite_env_quant.shape[0]
ones_n = np.ones((n, 1))

# Concatenate a column matrix of 1's
mite_env_con1 = np.hstack(tup=[ones_n, mite_env_quant])

# Show the first five values
print("The first five values of mite_env_con1:")
print(mite_env_con1[0:5, ])

# There is also np.vstack() (vertical stack) for concatenating observations by columns.

# If you know R, these are similar to rbind() and cbind().

The first five values of mite_env_con1:
[[  1.    39.18 350.15]
 [  1.    54.99 434.81]
 [  1.    46.07 371.72]
 [  1.    48.19 360.5 ]
 [  1.    23.55 204.13]]


# LINALG


In [None]:
#np.linalg.cholesky()
#np.linalg.eig()
#np.linalg.qr()
#np.linalg.svd()
#np.linalg.inv()

In [None]:
# Create a square matrix that could be a covariance matrix between two variables
S = np.array([[1.0, 0.8],
              [0.8, 1.0]])

In [38]:
# Compute its determinant
print("The determinant of S:")
print(np.linalg.det(a=S))
print(type(np.linalg.det(a=S)))

The determinant of S:
0.3599999999999999
<class 'numpy.float64'>


In [39]:
# Get the inverse of the S matrix
Sm1 = np.linalg.inv(a=S)

print("The inverse of S:")
print(Sm1)
print(type(Sm1))

The inverse of S:
[[ 2.77777778 -2.22222222]
 [-2.22222222  2.77777778]]
<class 'numpy.ndarray'>


In [40]:
print("\nThe result of Sm1 x S:")
print(Sm1 @ S)
print(type(Sm1 @ S))

print("\nThe result of S x Sm1:")
print(S @ Sm1)
print(type(S @ Sm1))


The result of Sm1 x S:
[[1.00000000e+00 0.00000000e+00]
 [2.12175956e-16 1.00000000e+00]]
<class 'numpy.ndarray'>

The result of S x Sm1:
[[1.00000000e+00 2.12175956e-16]
 [0.00000000e+00 1.00000000e+00]]
<class 'numpy.ndarray'>


In [51]:
# Invert matrix S2 which is singular
# Uncomment at your own risk (there ain't no risk, it's mathematically impossible)

# Create a square matrix that could be a covariance matrix between two variables
S2 = np.array([[1.0, 0.8],
               [2.0, 1.6]])
#
np.linalg.det(S2)

np.float64(0.0)

In [52]:
# Invert matrix S2 which is singular
# Uncomment at your own risk (there ain't no risk, it's mathematically impossible)
#
np.linalg.inv(S2)

LinAlgError: Singular matrix

# The return of list unpacking

In [53]:
# Perform eigenanalysis of S
print(np.linalg.eig(a=S))

EigResult(eigenvalues=array([1.8, 0.2]), eigenvectors=array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))


In [54]:
# So you'd better catch them as:
out_values, out_vectors = np.linalg.eig(a=S)

# Look at them!
print("\nThe eigenvalues:")
print(out_values)

print("\nThe eigenvectors:")
print(out_vectors)


The eigenvalues:
[1.8 0.2]

The eigenvectors:
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]


In [55]:
# You can now write a small function for PCA
def pca(X):
    n = X.shape[0]
    Xc = X - X.mean(axis=0)
    S = 1.0 / (n - 1.0) * Xc.T @ Xc
    lam, U = np.linalg.eig(a=S)
    F = Xc @ U
    return [lam, U, F]

In [57]:
# Apply it on the mite_env_quant dataset
pca(X=mite_env_quant)

# Note: You would actually need to standardize variables, but this is for explanatory purposes.

[array([  124.71443868, 20285.35214136]),
 array([[-0.999555  , -0.02982947],
        [ 0.02982947, -0.999555  ]]),
 array([[-1.70115975e+00,  6.04618750e+01],
        [-1.49787615e+01, -2.46320554e+01],
        [-7.94467208e+00,  3.86959485e+01],
        [-1.03984153e+01,  4.98477172e+01],
        [ 9.56618594e+00,  2.06883131e+02],
        [-2.09845050e+01,  9.85035915e+01],
        [ 1.38634001e+00,  3.17612017e+01],
        [-4.55796152e+01,  1.42559537e+02],
        [-2.51180311e+01,  9.92306142e+01],
        [ 1.47517765e+00,  1.90034283e+02],
        [-4.55651909e+00,  2.76492834e+02],
        [-7.65447770e+00,  4.49938750e+00],
        [ 6.32850490e+00,  1.67198893e+02],
        [-3.07235100e+00,  1.71110211e+02],
        [-2.24273096e+01,  5.93531316e+01],
        [ 1.22358524e+00,  8.88417475e+01],
        [ 6.32763164e+00,  1.13925161e+02],
        [ 9.43417385e+00,  1.34536999e+02],
        [-5.61431386e+00,  2.66501015e+01],
        [-7.23064483e+00,  2.64857889e+02],
    

# Masked matrices

In [58]:
# Generate a 5x5 matrix with values either -1 or 1
ex_array = np.random.choice(a=[0.0, 1.0, 2.0, 3.0, -999.0], size=(5, 5))

# See the values in my array
print("Array with some values as -999")
print(ex_array)

# Create a copy of the array and replace values that are equal to -999 with np.nan
ex_nan = ex_array.copy()
ex_nan[ex_nan == -999] = np.nan

# See the values in the mask
print("\nArray with -999 coded as np.nan:")
print(ex_nan)

Array with some values as -999
[[   2. -999.    2.    3.    3.]
 [   0. -999. -999.    0.    2.]
 [   3.    1.    1.    2.    3.]
 [   1. -999.    2.    2.    1.]
 [   0.    3.    3. -999.    0.]]

Array with -999 coded as np.nan:
[[ 2. nan  2.  3.  3.]
 [ 0. nan nan  0.  2.]
 [ 3.  1.  1.  2.  3.]
 [ 1. nan  2.  2.  1.]
 [ 0.  3.  3. nan  0.]]


In [59]:
# Determine a boolean mask defined by whether or not values are equal to -999
mymask = ex_array == -999

# See the values in the mask
print("Boolean mask:")
print(mymask)

# Create a masked array from this mask
print("\nMasked array:")
ex_mask = np.ma.masked_array(data=ex_array, mask=mymask)
print(ex_mask)

Boolean mask:
[[False  True False False False]
 [False  True  True False False]
 [False False False False False]
 [False  True False False False]
 [False False False  True False]]

Masked array:
[[2.0 -- 2.0 3.0 3.0]
 [0.0 -- -- 0.0 2.0]
 [3.0 1.0 1.0 2.0 3.0]
 [1.0 -- 2.0 2.0 1.0]
 [0.0 3.0 3.0 -- 0.0]]


In [60]:
# Print out the mean of these arrays
print("The mean of ex_array is:", ex_array.mean())
print("The mean of ex_nan is:", ex_nan.mean())
print("The nanmean of ex_nan is:", np.nanmean(ex_nan))
print("The mean of ex_mask is:", ex_mask.mean())

The mean of ex_array is: -198.44
The mean of ex_nan is: nan
The nanmean of ex_nan is: 1.7
The mean of ex_mask is: 1.7


# Views and copies

In [61]:
# Create three vectors
vec_1 = np.array([1, 2, 3, 4, 5])
vec_2 = vec_1[2:]
vec_3 = vec_1[2:].copy()

# Look at them!
print(vec_1)
print(vec_2)
print(vec_3)

[1 2 3 4 5]
[3 4 5]
[3 4 5]


In [62]:
print("ID of vec_1:", id(vec_1))
print("ID of vec_2:", id(vec_2))
print("ID of vec_3:", id(vec_3))

ID of vec_1: 2683190879472
ID of vec_2: 2683191835984
ID of vec_3: 2683192279376


In [63]:
print("Does vec_1 share memory with vec_2?")
print(np.shares_memory(vec_1, vec_2))

print("\nDoes vec_1 share memory with vec_3?")
print(np.shares_memory(vec_1, vec_3))

print("\nDoes vec_2 share memory with vec_3?")
print(np.shares_memory(vec_2, vec_3))

Does vec_1 share memory with vec_2?
True

Does vec_1 share memory with vec_3?
False

Does vec_2 share memory with vec_3?
False


In [64]:
# Change a value in vec_2
vec_2[1] = 9999

# Look at them!
print("Values of vec_1:")
print(vec_1)

print("\nValues of vec_2:")
print(vec_2)

print("\nValues of vec_3:")
print(vec_3)

Values of vec_1:
[   1    2    3 9999    5]

Values of vec_2:
[   3 9999    5]

Values of vec_3:
[3 4 5]


In [None]:
# So ask yourself when subsetting with NumPy:

# Will I do some analyses on this part and then go back to the original data?

# - If YES: Consider a .copy() of the data so you don't alter it unintentionally.
# - If NO: You can stick with a view, it is more memory-efficient (i.e. you weren't going to use it anyways).