## Session 1 (Spring 2026)

### Review of the Syllabus & Canvas Site

- Textbooks
- Schedule of Topics
- Assessments
- Announcements
- Discussions
- Quizzes
- **Modules:** Organized by week with Jupyter notebooks, Python modules, additional notes, links to articles etc.



## Other Documentation/Software

Jupyter notebooks and Jupyter Lab; Extensive [Documentation](https://jupyter.readthedocs.io/en/latest/index.html) 

To use Jupyter notebooks:

- [Markdown Basics](https://daringfireball.net/projects/markdown/basics)
- Look up Help from within the notebook interface!!

You can also use a reasonable text editor of IDE:

- [PyCharm](https://www.jetbrains.com/pycharm/) (the Free Community Edition or the Educational edition),
- [Visual Studio Code](https://code.visualstudio.com/), or
- **Spyder** (available as part of the Anaconda distro), or

any old plaintext editor like `emacs` or `vi` to edit Python modules.

## Linear Algebra (prelimininary)

- Vectors, Vector Spaces
- Matrices
- Matrix products
- [Numpy](https://numpy.org/) and [Scipy](https://scipy.org/)


## Matrix Products

- Scalar multiplication
- Multiplication $Ax$
   * using the *rows* of $A$  (**inner/dot product**)
   * using the *columns* of $A$!  (**outer product**)
- Column/Row Space of a matrix
- Independent rows and columns: Row rank and column rank of a matrix
- A vector space interpretation of matrix-vector and matrix-matrix multiplication
    

## NumPy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transforms, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Check out the [User Guide](https://numpy.org/doc/stable/user/)!

In [1]:
print("Hullo class!")

Hullo class!


In [2]:
import numpy as np

In [3]:
# declare a vector using a list as the argument
v = np.array([1,2,3,4])
v

array([1, 2, 3, 4])

In [4]:
# declare a matrix using a nested list as the argument
m = np.array([[1,2],[3,4]])
m

array([[1, 2],
       [3, 4]])

In [5]:
# still the same core type with different shapes
type(v), type(m)

(numpy.ndarray, numpy.ndarray)

In [6]:
m.shape

(2, 2)

In [7]:
# arguments: start, stop, step
x = np.arange(0, 10.6, 0.3)
x

array([ 0. ,  0.3,  0.6,  0.9,  1.2,  1.5,  1.8,  2.1,  2.4,  2.7,  3. ,
        3.3,  3.6,  3.9,  4.2,  4.5,  4.8,  5.1,  5.4,  5.7,  6. ,  6.3,
        6.6,  6.9,  7.2,  7.5,  7.8,  8.1,  8.4,  8.7,  9. ,  9.3,  9.6,
        9.9, 10.2, 10.5])

In [8]:
np.linspace(0, 10, 25)

array([ 0.        ,  0.41666667,  0.83333333,  1.25      ,  1.66666667,
        2.08333333,  2.5       ,  2.91666667,  3.33333333,  3.75      ,
        4.16666667,  4.58333333,  5.        ,  5.41666667,  5.83333333,
        6.25      ,  6.66666667,  7.08333333,  7.5       ,  7.91666667,
        8.33333333,  8.75      ,  9.16666667,  9.58333333, 10.        ])

In [9]:
np.logspace(0, 10, 10, base=np.e)

array([1.00000000e+00, 3.03773178e+00, 9.22781435e+00, 2.80316249e+01,
       8.51525577e+01, 2.58670631e+02, 7.85771994e+02, 2.38696456e+03,
       7.25095809e+03, 2.20264658e+04])

In [10]:
x, y = np.mgrid[0:5, 0:5]
x

array([[0, 0, 0, 0, 0],
       [1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4]])

In [11]:
y

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

In [12]:
m = np.diag([1.1,2.2,3.3])
m

array([[1.1, 0. , 0. ],
       [0. , 2.2, 0. ],
       [0. , 0. , 3.3]])

In [13]:
m.nbytes

72

In [14]:
m.ndim

2

In [15]:
v[0], m[1,1]

(np.int64(1), np.float64(2.2))

In [16]:
m[1]

array([0. , 2.2, 0. ])

In [17]:
# assign new value
m[0,0] = 7
m

array([[7. , 0. , 0. ],
       [0. , 2.2, 0. ],
       [0. , 0. , 3.3]])

In [18]:
m[0,:] = 0
m

array([[0. , 0. , 0. ],
       [0. , 2.2, 0. ],
       [0. , 0. , 3.3]])

In [19]:
# slicing works just like with lists
a = np.array([1,2,3,4,5])
a[1:3]

array([2, 3])

In [20]:
a = np.array([[n+m*10 for n in range(5)] for m in range(7)])
a

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44],
       [50, 51, 52, 53, 54],
       [60, 61, 62, 63, 64]])

In [21]:
row_indices = [1, 2, 3]
a[row_indices]

array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])

In [22]:
# index masking
b = np.array([n for n in range(5)])
row_mask = np.array([True, False, True, False, False])
b[row_mask]

array([0, 2])

## SciPy

Collection of mathematical algorithms and convenience functions built on top of `NumPy`.

Provides high-level commands and classes for manipulating and visualizing data, for example:

- `scipy.linalg`: a faster, more efficient extension of `numpy.linalg`
- `scipy.stats`: statistical tools (distributions, sampling  etc.)
- `scipy.spatial`: spatial data structures and algorithms, e.g., calculating distances between points

Check out the [User Guide](https://docs.scipy.org/doc/scipy/tutorial/)!

## Randomization (prelimininary)

- Random Variables
- Indicator Random Variables
- Expectation
- Variance & Standard Deviation
- Uniform, Bernoulli and Geometric distributions
- Normal Distribution

## Events In a Sample Space

- *event/sample space*: a set 
- *elementary events*: elements of the event space 
- *events*: *subsets* of elementary events. 

### Sequences of 3 coin flips

Event Space: $\{TTT, ~TTH, ~THT, ~THH, ~HTT, ~HTH, ~HHT, ~HHH\}$

An example event: sequences that begin with $H$, i.e.
$\{~HTT, ~HTH, ~HHT, ~HHH\}$

## Probability

With every elementary event $E$, we associate a non-negative real number which is the **probability** $\text{Pr}(E)$ of the event.  Probabilities are:

- are between 0 and 1,
- additive for **disjoint** events, and
- sum to 1 over the entire event space

## Random Variables

A random variable $X$ is a **function** that maps elementary events to real numbers.

$X=x$: the event that is the **union** of all elementary events that are mapped to the same real value $x$. 

### Number of heads in a sequence of 3 fair coin flips

$X$ can have values 0, 1, 2 and 3

$X=2$ is the event consisting of the set of all 3-flip sequences with **exactly 2 heads**, namely $\{HHT, ~HTH, ~THH\}$

$\text{Pr} (X=2) ~=~\frac{3}{8}$ since each flip sequence is equally likely. 

## Conditional Probability

$$\text{Pr}(A ~|~ B)$$

is the probability of event $A$ **given** event $B$. 

For example, the probabilty of at least two tails among 4 tosses given that there are at least two heads is $\frac{6}{11}$. **Why?**

### Bayes' Theorem

$$\text{Pr}(A ~|~ B) = \frac{\text{Pr}(A ~\cap~ B)}{\text{Pr}(B)} = \frac{\text{Pr}(B ~\cap~ A)}{\text{Pr}(B)} \text{Pr}(A)$$

## Independence

Two events $A$ and $B$ are independent if 

$$\text{Pr}(A ~|~B) = \text{Pr}(A)$$

Equivalently: if $\text{Pr}(A ~\cap~ B) = \text{Pr}(A)\text{Pr}(B)$

## Expected/Mean Value and Variance

The **expectation** (aka **expected value**, **mean value**) of the random variable $X$ is gven by:

$$E[X] = \sum_{x} x \cdot Pr(X=x)$$

The sum becomes a **definite integral** in a natural way for **continuous** valued random variables. 

The **variance** of the r.v. $X$ is given by:

$$\text{Var}(X) = E[(X-E[X])^2]$$

An easy derivation shows that the variance equals $E[X^2] - E[X]^2$ (exercise for you!!)



## Covariance

The covariance of $X$ and $Y$ is given by:

$$\text{Cov}(X,Y) = E[(X-E[X])(Y-E[Y])]$$

### Exercise

Express $\text{Var}(X + Y)$ in terms of $Var(X)$ and $Var(Y)$ and a residual term. What is that term?

## Expectation: Linearity, Deviation

- **Linearity of Expectation:** The expected value of a sum of random variables is the sum of their expected values.

- **Markov's Inequality:** If $X$ is a non-negative valued random variable, then 
$$\text{Pr}\{X > t\} \leq \frac{E(X)}{t}$$

- **Chebychev's Inequality:** For any random variable $X$,  
$$\text{Pr}\{\mid X - E[X] \mid \geq k\} \leq \frac{Var[X]}{k^2}$$


## Preliminaries

- Useful summations: arithmetic, geometric, harmonic, binomial
    $$\sum_{i=1}^{n} ~i ~=~ \frac{n(n+1)}{2}$$
    $$\sum_{i=0}^{n-1} ~c^{i} ~=~ \frac{c^{n}- 1}{c-1}$$
- Big-oh notation, e.g. 
    $$\sum_{i=1}^{n} ~i ~=~ \Theta(n^2)$$
- Useful inequalities and approximations, e.g. for small $x$
    $$e^{x} ~=~ 1 + \frac{x}{1!} + \frac{x^2}{2!} + \ldots ~\approx~ 1+x$$ 


## Distributions

- Uniform
- Bernoulli
- Binomial
- Geometric
- Normal

Use **Wikipedia** to check out these and other simple distributions. 


## Indicator Random Variables

These are variables that **indicate** a sharp **0 or 1 value** based on whether an event does not happen (value 0)  or happens (value 1). The most common example is, of course, the event that a coin toss successfully turns up heads: a **Bernoulli trial**. The following notation ($\mathbb{I}$) is commonly used:

$$X = \mathbb{I}[\text{toss is H}], ~~\text{with probability } p$$

Then the mean and variance of $X$ are particularly easy to compute:

$$\begin{align}
E[X] & = p\cdot 1 + (1-p)\cdot 0 \\
& = p
\end{align}
$$
$$\begin{align}
\text{Var}(X) & = E[X^2] - E^2[X] \\
& = p - p^2 \\
& = p(1-p)
\end{align}
$$

## Hashing

- Hash functions: keys are hashed to buckets 
- Open-addressing (one key per bucket)
- Hashing with chaining (colliding keys chained in bucket)
- Assumptions: uniform hashing
- Used very widely to construct efficient data structures for storing big data (indexes).

#### Collision

> Occurs when keys $x \not= y$ end up hashing to the same slot! $$h(x) = h(y)$$

- Collisions are inevitable, especially when we have no prior knowledge of $V$. 

- Distribution of the number of keys that collide in different slots can vary wildly!

> **Ideal distribution:** collisions distributed *uniformly*.

In [23]:
# Examples of hash functions

def h(key, m):
    """Returns slot index for key
    
    Args:
        key (int): number to be hashed
        m (int): table size    
    """
    return key % m

def g(d, s):
    """Returns 32-bit number as slot index
    
    Uses the FNV algorithm from http://isthe.com/chongo/tech/comp/fnv/ 
    
    Args:
        d (int)
        s (str): key to be hashed
    
    """
    if d == 0: 
        d = 0x01000193
    for c in s:
        d = ( (d * 0x01000193) ^ ord(c) ) & 0xffffffff
    return d

In [24]:
g(0, 'Jaden')

1542760701

### Universal hash family

We start by defining a universal family of **hash functions**. This a family with the property that:

> for any pair of keys in the universe of keys (i.e. $U$), a hash function drawn uniformly at random from the family will cause the keys to collide with an expected probability of $1/m$, where $m$ is the table size.

Let $p$ be a large prime, and let $a, b$ be random integers chosen uniformly such that $1 \leq a \leq p-1$ and $0 \leq b \leq p-1$.

In [25]:
import random
import math
import numpy as np
import pandas as pd

rng = np.random.default_rng()
class UHF:
    """A factory for producing a universal family of hash functions"""

    @staticmethod
    def is_prime(k):
        if k%2==0:
            return False
        for i in range(3, int(math.sqrt(k)), 2):
            if k%i == 0:
                return False
        return True
    
    def __init__(self, n):
        """Universe size is n"""
        self.n = n
        if n%2==0:
            m = n+1
        else:
            m = n+2
        while not(UHF.is_prime(m)):
            m = m+2
        self.p = m
        
    def make_hash(self, m):
        """Return a random hash function
        
        m: table size
        """
        a = random.randint(1,self.p-1)
        b = random.randint(0,self.p-1)
        return lambda k: ((a*k+b)%self.p)%m

In [26]:
factory = UHF(1000000)

In [27]:
factory.p

1000003

### Test the Universal Hash Family (Exercise)

Set up and evaluate an experiment to verify that the family defined above is indeed a universal family. Use a reasonable universe of keys, say integers in $range(1000000)$ and a hash table of size $5000$. 

- Fix two keys at random
- Generate a lot of hash functions and see how many of them collide 

## Similarity and Minhashing

### Data Mining

Given lots of data, discover patterns and models that are:

- **Valid:**  hold on new data with some certainty
- **Useful:**  should be possible to act on the item 
- **Unexpected:**  non-obvious to the system
- **Understandable:** humans should be able to interpret the pattern

#### Data as Feature Matrices

- Each data item is generically referred to as a **document**

- In the simplest case, documents are *characterized* by the presence or absence of *binary* features (indicated by 0 and 1), i.e., the data is the **characteristic matrix** of the features.


In [28]:
import pandas as pd

In [29]:
characters = pd.DataFrame({'Hermione': [1,0,0,1,1,0,1],
                           'Harry': [0,0,1,1,1,0,1],
                           'Ron': [1,1,0,0,0,1,1],
                           'Severus': [1,0,0,0,0,1,1]})
                           
                           

In [30]:
characters

Unnamed: 0,Hermione,Harry,Ron,Severus
0,1,0,1,1
1,0,0,1,0
2,0,1,0,0
3,1,1,0,0
4,1,1,0,0
5,0,0,1,1
6,1,1,1,1


### Jaccard Similarity

Jaccard *similarity* between two documents $D_i$ and $D_j$ equals 
$$ \mbox{SIM}(i,j) = \frac{|D_i \cap D_j|}{|D_i \cup D_j|}.$$



In [31]:
f_matrix = characters.to_numpy()

**Question:** What is the Jaccard similarity between `Harry` and `Hermione`? Between `Severus` and `Ron`?

#### Compressing a document's representation: Minhash Signatures

Choose a *random* permutation $p$ of the features. A **minhash signature** 
$${\mbox{sign}}_p(i)$$
for document $D_i$ using permutation $p$ is obtained as follows:

- recall that the document is a feature vector in the original order of features
- **Reorder** the feature vector according to the permutation $p$
- Among all features present in the document, determine the one with the **smallest index in the new ordering**! This index is the signature for the document.

#### Example

In [32]:
f_matrix = characters.to_numpy()
f_matrix

array([[1, 0, 1, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 1, 1],
       [1, 1, 1, 1]])

For our example above, consider document `Hermione` with feature vector 
$$[1,0,0,1,1,0,1]$$ (the first column above).

Consider the *permutation* of the **feature indices** given by 
$$[3,1,0,6,2,5,4].$$

This says that feature 3 will be numbered 0, feature 1 still numbered 1, feature 0 numbered 2 and so on. Hence, the permutation will **reorder** document `Hermione` as
$$[\underline{1},0,1,1,0,0,1]$$
e.g. the underlined 1 at the new index **0**, corresponds to the original feature number 3. 

If we apply the permutation across all characters (columns), we get:

In [33]:
perm = [3,1,0,6,2,5,4]
perm_mat = np.take(f_matrix, perm, axis=0)
perm_mat

array([[1, 1, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 1, 1],
       [1, 1, 1, 1],
       [0, 1, 0, 0],
       [0, 0, 1, 1],
       [1, 1, 0, 0]])

In [34]:
f_matrix

array([[1, 0, 1, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 1, 1],
       [1, 1, 1, 1]])

To determine the **signature** of document `Hermione` according to permutation `perm` above, we just walk down the *permuted* column until we find the **first row index** corresponding to a feature that `Hermione` *contains*, viz. the 1 entry at index **0**. 

Doing this for all columns, the signatures for the documents (in order) according to `perm` are: 0, 0, 1, 2.

In [35]:
np.argmin(1 - perm_mat, axis=0)

array([0, 0, 1, 2])

## Questions

- What are the signatures for the other documents according to the permutation $[6,4,2,0,5,3,1]?
- What signatures are possible, in theory, for `Harry`?
- What **common signatures** are possible for `Harry` and `Severus`?


In [36]:
f_matrix

array([[1, 0, 1, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 1, 1],
       [1, 1, 1, 1]])

### Using numpy to compute document signatures

We can do this in a variety of ways. For example:

1. Start with a random permutation $p$ of the row indices. Apply the permutation $p$ to the matrix, then determine the signature by using `argmin` over the matrix with entries *flipped* (to find the first 1).

In [37]:
p_mat = rng.permutation(f_matrix, axis=0)
np.argmin(1 - p_mat, axis=0)

array([1, 1, 0, 1])

2. We need not actually permute the rows of the matrix at all - permuting rows can be very expensive with large data. Instead, we imagine that the rows are numbered from 1 through $m$, the number of features. A permutation can be thought of as supplying **weights** for the corresponding features. We multiply each feature by its weight, then find the smallest non-zero weighted entry in a column: that is the signature! 

In [38]:
## Mask all 0 values
n_rows = f_matrix.shape[0]
masked = np.where(f_matrix == 0, n_rows+1, f_matrix)
masked

array([[1, 8, 1, 1],
       [8, 8, 1, 8],
       [8, 1, 8, 8],
       [1, 1, 8, 8],
       [1, 1, 8, 8],
       [8, 8, 1, 1],
       [1, 1, 1, 1]])

In [39]:
f_matrix

array([[1, 0, 1, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [1, 1, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 1, 1],
       [1, 1, 1, 1]])

In [40]:
perm_arr = np.array(perm) + 1  # make elements non-zero
permuted_tr = masked.T * perm_arr  # transpose to do elementwise multiplication

In [41]:
permuted_tr

array([[ 4, 16,  8,  7,  3, 48,  5],
       [32, 16,  1,  7,  3, 48,  5],
       [ 4,  2,  8, 56, 24,  6,  5],
       [ 4, 16,  8, 56, 24,  6,  5]])

In [42]:
signs = np.min(permuted_tr.T, axis=0)

In [43]:
signs

array([3, 1, 2, 4])

## Discussion (5 minutes)

We used the same permutation `perm` = [3,1,0,6,2,5,4] but seem to get different signatures. 
Why is that? Discuss in your group why this is happening. Understand the code in the cells below to frame the discussion.


In [44]:
perm_inv = np.argsort(perm)
perm_inv

array([2, 1, 4, 0, 6, 5, 3])

In [45]:
np.argmin(1 - np.take(f_matrix, perm, axis=0), axis=0)

array([0, 0, 1, 2])

### Relationship between Jaccard similarity and signatures

Given a *random* permutation $p$, the Jaccard similarity, $\mbox{SIM}(i,j)$ is related to signatures by the following property:

> $$\mbox{Pr}\{{\mbox{sign}}_p(i) = {\mbox{sign}}_p(j)\} = \mbox{SIM}(i,j)$$

Thus, the likelihood of two documents being deemed similar (via their signatures) will be in accordance with their (mathematical) similarity!

**Proof:** In class!

## Minhashing

Based on the proof, we can see that if we were to calculate a large number of **different signatures** (each corresponding to a different permutation), we can approximate the true Jaccard similarity of two documents as the fraction of permutations with matching signatures. 

Random permutations are very expensive to compute: it takes time linear in $n$ to obtain such a permutation. When $n$ is very large, this is infeasible.

So, we **approximate the signature** by using a **random hash function** in
place of a random permutation!

> Use the **range** of a hash function applied to the feature indices as a **proxy** for permuting the indices of the features. 

## Minhash Signatures

Minhashing using just one small signature will unfortunately produce both **false positives** and **false negatives** for similarity. We try to mitigate this as follows:

- generate **many** small signatures using different hash functions

*Minhash similarity* between two documents is the expected proportion of hash
functions for which their small signatures agree!

#### Jaccard Distance

The Jaccard distance between two documents $i$ and $j$ is given by $$d(i,j)=1 -
\mbox{SIM}(i,j)$$. 

It is distance *metric*:

- it equals 0 iff $D_i = D_j$
- it is symmetric: $d(i,j)=d(j,i)$
- it satisfies the *triangle inequality*: $d(i,j) ~<=~ d(i,k) ~+~ d(k,j)$

#### Exercise

- Compute signature matrices for a document using both random permutations and hash functions

- Use the matrices to see how accurately they approximate the true jaccard similarity.


In [46]:
def sign(mat, h_fn, p):
    """
    Return array of signatures for all documents

    The shape of the matrix is (m, n) where m = number of documents
    and n = number of features. The hash function has domain equal to
    the column indices 0, 1, ..., n-1
    
    mat (np.array): 0/1 matrix
    h_fn (int -> int): universal hash function whose range is 0, ..., p-1
    p (int): upper bound on values returned by hfn 
    """
    n_features = mat.shape[1]
    masked = np.where(mat == 0, n_features + 1, mat)
    s = masked*np.vectorize(h_fn)(np.arange(n_features))
    return np.min(s, 1)


In [47]:
tbl_size = 53
uhf = UHF(tbl_size)
h = uhf.make_hash(tbl_size)

In [48]:
signatures = sign(f_matrix.T, h, tbl_size)
signatures

array([3, 3, 2, 2])