CSU DSCI 369 Lab 9
Instructor: Emily J. King
Spring 2024

Goals:  Learn commands to compute norm, cosine similarity, correlation coefficient, and dimension of span.


In [1]:
import numpy as np
from numpy.linalg import norm
from numpy.linalg import matrix_rank as rank

Play with some random vectors.

Create two random vectors x and y in R^7.

In [2]:
x=np.random.rand(7)
y=np.random.rand(7)

Compute the norm of each vector.

In [3]:
norm(x)

1.6497949063146993

In [4]:
norm(y)

1.345827364348591

Normalize each vector and call the normalized vectors nx and ny, respectively.


In [5]:
nx=x/norm(x)
ny=y/norm(y)

Double check the norms of the normalized vectors.  What should they be?


In [6]:
norm(nx)

0.9999999999999999

In [7]:
norm(ny)

0.9999999999999999

Compute the cosine similarity of x and y.  This is just the inner product of nx and ny. There are built-in cosine similarity functions that compute the cosine similarity of x and y directly, but these would require installing additional packages.  If you are skilled enough in Python, you may make your own function to compute cosine similarity of x and y and submit along with your Jupyter notebook.

In [8]:
np.dot(nx,ny)

0.7407010742047315

Compute your number with the others in the lab. Why do these numbers make sense? Discuss.

Now let's compute the correlation coefficient.  In nupy, we have a function that computes a matrix of the the correlation coefficients of lists of arrays of the same size, i.e.:

In [9]:
np.corrcoef(x, y)

array([[ 1.        , -0.18759866],
       [-0.18759866,  1.        ]])

The (1,1) entry is the correlation coefficient of x with itself, which is 1 since x perfectly linearly predicts x.  Similarly, the (2,2) entry is the correlation coefficient of y with itself.  The (1,2) and (2,1) entries are the correlation coefficient of x and y (and y and x, which is the same thing). The corrcoef command is able to compute correlation coefficients of pairs of vectors from a longer list by adding more columns, e.g., np.corrcoef(x1, x2, x3, x4, x5).

Compute your number with the others in the lab. Why do these numbers make sense? Discuss.

Now let's compute directly the correlation coefficient using the formula from the class, which is the cosine similarity of the centered vectors.

In [10]:
np.dot((x-np.mean(x))/norm(x-np.mean(x)),((y-np.mean(y))/norm(y-np.mean(y))))


-0.18759866382725157

This should be exactly the same as the (1,2) entry of the correlation matrix computed above.

Finally, we compute the dimension of the span of the vectors x and y by computing the rank of the matrix with x and y as columns.

In [11]:
rank(np.column_stack((x,y)))

2

Why does this make sense?  Discuss.

Instead, let's compute the cosine similarity, correlation coefficient, and dimension of the span of x and 2.3x:

In [12]:
np.dot(x/norm(x),2.3*x/norm(2.3*x))

0.9999999999999999

In [13]:
np.dot((x-np.mean(x))/norm(x-np.mean(x)),((2.3*x-np.mean(2.3*x))/norm(2.3*x-np.mean(2.3*x))))

1.0

In [14]:
rank(np.column_stack((x, 2.3*x)))

1

Why do these numbers make sense? Discuss.

And we now compute the dimension of the span of x, y, and 2.3 x.

In [16]:
rank(np.column_stack((x, y, 2.3*x)))

2

Why does this make sense?

Exercises

1. Make two random vectors u and v in R^100.

In [17]:
u=np.random.rand(100)
v=np.random.rand(100)

2. Make w=u+v

In [18]:
w = u + v

3. Make z be u with some Gaussian noise added 0.1*np.random.normal(0,1,np.size(u))

In [19]:
z = u + (0.1*np.random.normal(0,1,np.size(u)))

4. Compute the cosine similarity of u and z. 


In [22]:
nu = u/norm(u)
nz = z/norm(z)
np.dot(nu,nz)

0.9880770083682999

5. Why does the answer to 4 make sense?


A cosine similarity of 0.988 makes sense because the cosine of a 0 degree angle is 1. The angle between the u and z vectors should be pretty low because they are very similar (z just has a little bit of noise). Because they are similar, we expect the angle between them to be close to 0 degrees, and by extension, the cosine between them to be close to 1.

6. Compute the correlation coefficients of all of the pairs of vectors in u, v, w, z.


In [24]:
np.corrcoef([u,v,w,z])

array([[1.        , 0.00287091, 0.67306292, 0.95191125],
       [0.00287091, 1.        , 0.74151448, 0.02073011],
       [0.67306292, 0.74151448, 1.        , 0.65400672],
       [0.95191125, 0.02073011, 0.65400672, 1.        ]])

7. Discuss the correlation coefficients computed in 6.  In particular, why does the correlation coefficient of u and z make sense?


The diagonal of the calculated matrix makes sense because the correlation coefficient of a vector and itself should always be 1 (because they're the same). The rest of these values signify how strongly correlated the vector pairs are. u and v must not be very strongly correlated because their correlation coefficient is closer to 0. u and z have a coefficient of 0.95 which is a very strong positive correlation. This makes sense because u and z are very similar vectors (z just has noise).


8. Compute the dimension of the span of u, v, w, z.


In [32]:
rank(np.column_stack(([u, v, w, z])))

3

9. Why does this make sense?

A rank of 3 makes sense because u and v both have a span of 2. Since w is u+v, it already lives in their span, so it has no effect on the overall rank. However, since z adds in a random noise vector, it adds a third dimension to the span, making the rank 3.