<a href="https://colab.research.google.com/github/cu-applied-math/appm-4600-numerics/blob/main/Labs/Lab12_LeastSquares.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 12: Least Squares

This lab focuses on the problem
$$\min_{\vec{x}\in\mathbb{R}^m}\;\|A\vec{x}-\vec{b}\|_2$$
where $A\in\mathbb{R}^{m\times n}$, and $m\ge n$ and $\text{rank}(A)=n$.

We'll make problems where we can check our answer, first creating a "true" $\vec{x}_{\text{true}}$, then set $\vec{b}=A \vec{x}_{\text{true}}$ for a given $A$ matrix, and your job is to solve for
$\vec{x}_{\text{estimate}}=\text{argmin}_{\vec{x}}\|A\vec{x}-\vec{b}\|_2$

#### Learning objectives
- Learn how to solve a least squares problem "from scratch" (well, partially from scratch)
- Understand pros and cons of different methods
  - Which ones require pivoting?
  - Which ones are faster?
  - Which ones are faster on multiple cores or a GPU?  (i.e., more parallelizable)
  - Which ones are more accurate?
  - Does this depend on conditioning of $A$? On its shape? (i.e., almost square vs very tall and skinny)

#### Tasks
1. Write at least **four** distinct implementations to solve the least-squares problem (the solutions have about implementations). The implementations can be similar to each other, but each variant should be non-trivial.  You **are** allowed to use linear algebra library functions, such as:
    - `scipy.linalg.inverse`
    - `scipy.linalg.solve` and all its variants (i.e., the different options for `assume_a`)
    - `scipy.linalg.qr` and all its variants (i.e., pivoting or not)
    - `scipy.linalg.solve_triangular` for forward/back substitution
    - `scipy.linalg.qr_multiply`
    - `scipy.linalg.svd`
    - `scipy.linalg.lstsq` and all its variants (i.e., the different options for `lapack_driver`). Since this routine does everything for you, you may only count it as **one** of your implementations (that is, if you do different options for `lapack_driver`, those don't count as separate implementations).
2. Use your implementations
    - Try "problem 0", a small easy problem that is good for checking whether your code works
    - Then try "problem 1" and "problem 2" which are larger and less well-conditioned. How well do the implementations work?  
        - **Time** your code
        - Also **record the error** in the $\ell_\infty$ norm, i.e., $\|\vec{x_{true}} - \vec{x_{est}}\|_\infty$
3. If you have time, try an implementation on the GPU. *We highly recommend that this part of the lab is done on google colab*
    - Use pytorch (already installed on colab)
    - Select a GPU runtime
    - See the scaffolded code below

#### Deliverables
1. Code for at least 4 implementations, and...
2. at least one sentence (or more) about some pros and cons of different methods.

Put both these deliverables into the same PDF and upload to Canvas

*APPM 4600. Copyright 2025 Department of Applied Mathematics, University of Colorado Boulder. Released under a BSD 3-clause license*

In [13]:
import numpy as np
import scipy.linalg as sla

## Part 1: write at least **four** implementations to solve the least-squares problem



In [15]:
# Problem 0: use this for debugging
problem_label = 'Problem 0'
rng     = np.random.default_rng(12345)
n       = 20
m       = 40
A       = rng.standard_normal( (m,n), dtype=np.double )
xTrue   = rng.standard_normal(n)
b       = A@xTrue
print(f'Matrix is size {m} x {n}')
print(f'Condition number of A is {np.linalg.cond(A):.2e}')

Matrix is size 40 x 20
Condition number of A is 3.78e+00


In [None]:
# Your solutions

x = ... TODO ...

err   = np.linalg.norm(x-xTrue,np.inf)
print(f'Error is {err:.2e}')

## Part 2: Try solving some larger problems
- Record the **error** in the infinity norm, and...
- Record the **time** it takes

To measure time, in jupyter/colab/iPython, you can use a line like

`%time x = myFunction(...)`

and it will time how long it takes `myFunction(...)` to run


What **observations** can you make?

Note that if you use `%time`, it will return both the:
1. CPU time, which is the sum of how long all cores on the computer spend. So if you have 4 cores, and each core takes 3 seconds (and is fully utilized), the CPU time is 12 seconds.
2. "Wall" time, which refers to "wall clock" time. This is the actual time elapsed from when you start to when you stop.

If the CPU time is 12 seconds and the wall time is 3 seconds, this indicates that the code ran in parallel.

In [17]:
# Problem 1
problem_label = 'Problem 1'
rng     = np.random.default_rng(12345)
n       = int(5e3)
m       = n+10
# To give ourselves an added challenge (and to be fair for when we compare to the GPU),
#   let's do this in single precision rather than double precision, so this means
#   we start with about 8 digits of accuracy
A       = rng.standard_normal( (m,n), dtype=np.single )
xTrue   = rng.standard_normal(n)
b       = A@xTrue
print(f'Matrix is size {m} x {n}')
print(f'Condition number of A is {np.linalg.cond(A):.2e}') # takes a bit of time to run! It's about 2e3 if n=5e3

Matrix is size 5010 x 5000
Condition number of A is 1.97e+03


In [18]:
# Problem 2
problem_label = 'Problem 2'
rng = np.random.default_rng(12345)

n   = int(2e3)
n2  = int(n/2)
m   = int(1e4)
A       = rng.standard_normal( (m,n2) )
c       = 1e-2 # make this smaller, like 1e-4, to make the matrix more ill-conditioned
A       = np.hstack(  (A,A+c*rng.standard_normal( (m,n2) )), dtype=np.single )
xTrue   = rng.standard_normal(n)
b       = A@xTrue
print(f'Matrix is size {m} x {n}')
print(f'Condition number of A is {np.linalg.cond(A):.2e}') # takes a bit of time to run! It's about 4.1e2

Matrix is size 10000 x 2000
Condition number of A is 4.14e+02


## Part 3: Try it on the GPU

Restart the colab runtime and select a GPU runtime (or try a TPU!).  If you register for the Colab "pro" account (free for students and educators!) you'll get access to more/faster GPUs.

We'll use PyTorch for our GPU computing.  You *can* run this just on the CPU also.  Some code is specific for either CPU or GPU, but most code runs on both.

PyTorch has its own `torch.linalg` library, which is similar to (but not the same) as `numpy.linalg` and `scipy.linalg`.  It has GPU implementations, but lacks some of the fancier features that scipy has, and overall has fewer methods.

For timing, you can still use `%time` for a single line, or `%%time` to time an entire cell. However, you need to make sure to call `torch.cuda.synchronize()` within the timed code, otherwise it won't time it properly (it will think that the code is done running when it's not really done).

In [1]:
import torch
import torch.linalg as tsla
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print('Device is ', device)

Device is  cpu


In [None]:
# Problem 0: use this for debugging
problem_label = 'Problem 0'

torch.manual_seed(1234)
n       = 20
m       = 40
A       = torch.randn(m,n,device=device) # by default it's single precision
xTrue   = torch.randn(n,1,device=device)
b       = A@xTrue
print(f'Matrix is size {m} x {n}, device is {device}')
print(f'Condition number of A is {tsla.cond(A):.2e}') # 1e3

In [None]:
# TODO: try a GPU implementation

# (note: you do NOT have to do a GPU implementation in order to get full credit for the lab - it's not one of the deliverables)

In [10]:
problem_label = 'Problem 1'
torch.manual_seed(1234)
n       = int(5e3)
m       = n+10
A       = torch.randn(m,n,device=device) # by default it's single precision
xTrue   = torch.randn(n,1,device=device)
b       = A@xTrue
print(f'Matrix is size {m} x {n}, device is {device}')
print(f'Condition number of A is {tsla.cond(A):.2e}') # 1e3

Matrix is size 5010 x 5000, device is cpu
Condition number of A is 1.73e+03


In [8]:
problem_label = 'Problem 2'
torch.manual_seed(0)

n   = int(2e3)
n2  = int(n/2)
m   = int(1e4)
A       = torch.randn(m,n2,device=device) # by default it's single precision
A       = torch.hstack(  (A,A+1e-2*torch.randn(m,n2,device=device) ) )
xTrue   = torch.randn(n,1,device=device)
b       = A@xTrue
print(f'Matrix is size {m} x {n}')
print(f'Condition number of A is {tsla.cond(A):.2e}') # 4e2

Matrix is size 10000 x 2000
Condition number of A is 4.14e+02


In [None]:
# TODO: do some tests on these larger/harder problems