# Algorithms for Data Science -- Laboratory 5
Author: Pablo Mollá Chárlez

# Computing Moments of a Stream

## 1. Preliminaries 

The objective of this lab is to $\textcolor{red}{\text{implement the Alon-Matias-Szegedy approach}}$ to estimate the second moment of the stream, also called the $\textcolor{orange}{\text{surprise number}}$, in which $N$ distinct items from $0$ to $N-1$ appear.

In [1]:
import random

# Parameters
# N distinct values between 0 and N-1
N = 256
stream_size = 10000

## 2. Alon-Matias-Szegedy Algorithm for Second Moments

We implement here the $\textcolor{cyan}{\text{Alon-Matias-Szegedy Algorithm}}$ when the stream size is known:

1. We $\textcolor{orange}{\text{choose a number t between 0 and stream\_size-1}}$ from which the counts are kept.

2. When the stream is at timestamp $t$, we $\textcolor{orange}{\text{initialize v = S(t) and c=1}}$.

3. $\textcolor{orange}{\text{Increment c by 1}}$, $\textcolor{red}{\text{if we encounter v}}$.

At the end of the stream, we output the estimator:
$$
\fbox{stream\_size $\times$ (2c-1)}
$$

This can be easily extended to an arbirary number of counts, by generating $k$ different timestamps and keeping arrays of $v$ and $c$.

In [2]:
# Initialize values and counts
v = []
c = []

# Keeping the true counts 
counts = {}

# Choosing k timestamps
k = 10
t = []
for _ in range(k):
  t.append(random.randrange(stream_size))
  v.append(-1)
  c.append(0)

for i in range(stream_size):
  # Take a random value between 0 and N-1
  s = random.randrange(N)
  # AMS approach
  for j in range(k):
    # Chosen timestamp
    if i==t[j]:
      v[j] = s
      c[j] = 1
    # After timestamp  
    elif i>t[j] and s==v[j]:
      c[j] += 1
  # True counts (only for evaluation!)
  if s not in counts:
    counts[s] = 0
  counts[s] = counts[s]+1

# True 2nd moment
true = 0
for x in counts.keys():
  true += counts[x]*counts[x]

# 2nd moment estimator
est = 0
for x in range(k):
  est += 2*c[x]-1
est = int((stream_size/k)*est)

print('Estimation of 2nd moment: %d'%est)
print('True second moment: %d'%true)

Estimation of 2nd moment: 356000
True second moment: 400890


## 3. **TASK** AMS for Infinite Streams

Implement the $\textcolor{red}{\text{case when the estimator does not know the size of the stream}}$.

In this case, instead of generating $k$ timestamps, we $\textcolor{cyan}{\text{proceed to use Reservoir Sampling}}$ as explained in the lecture:

1. Initialize $\textcolor{orange}{\text{v and c}}$ with the corresponding values in the first $k$ items in the stream $S$,

2. For timestamp $t>k$, we $\textcolor{orange}{\text{decide whether to replace a v with probability } \frac{k}{t}}$,

3. If $\textcolor{green}{\text{true}}$, we $\textcolor{green}{\text{replace a value}}$ ($\textcolor{green}{\text{and its corresponding count}}$) $\textcolor{green}{\text{at random}}$ in the arrays $v$ and $c$ (and re-initialize the values).

In [3]:
import random

# Parameters
# N distinct values between 0 and N-1
N = 256

# Stream size
stream_size = 10000

# Different values of k to evaluate
k_values = [10, 50, 100]

# Keeping the true counts (for evaluation purposes)
counts = {}
stream = []

# Simulating the stream
for i in range(stream_size):
    # Take a random value between 0 and N-1
    s = random.randrange(N)
    
    # Storing stream values (only for evaluation purposes)
    stream.append(s)
    
    # True counts (only for evaluation purposes)
    if s not in counts:
        counts[s] = 0
    counts[s] += 1

# True second moment (for evaluation)
true_second_moment = sum(count * count for count in counts.values())
print(f'True second moment: {true_second_moment}')

# Evaluating AMS estimator for different k values
for k in k_values:
    # Initialize v and c for AMS algorithm
    # To store the sampled values
    v = [-1] * k
    # To store the counts for the sampled values
    c = [0] * k

    # AMS algorithm with reservoir sampling
    for i in range(stream_size):
        # Use the previously generated stream values
        s = stream[i]  

        if i < k:
            # Initialize v and c for the first k elements
            v[i] = s
            c[i] = 1
        else:
            # For elements after the first k, we use reservoir sampling
            if random.random() < (k / (i + 1)):
                # Choose a random index to replace
                j = random.randint(0, k - 1)
                
                # Replace the value
                v[j] = s
                
                # Reinitialize count
                c[j] = 1
            else:
                # Check if the current stream value matches any in v
                for j in range(k):
                    if s == v[j]:
                        # Increment the count if value matches
                        c[j] += 1

    # Estimating the second moment using AMS estimator
    est_second_moment = 0
    for x in range(k):
        est_second_moment += (2 * c[x] - 1)

    # Scale the estimate by stream_size / k
    est_second_moment = int((stream_size / k) * est_second_moment)

    # Output the estimated second moment for this k
    print(f'Estimation of 2nd Moment with k = {k}: {est_second_moment}')


True second moment: 400114
Estimation of 2nd Moment with k = 10: 514000
Estimation of 2nd Moment with k = 50: 395600
Estimation of 2nd Moment with k = 100: 393200


## 4. Discussion of Results:

### 4.1 $\underline{\text{True Second Moment}}$:
- We obtained as true second moment 400,114. This is the exact sum of the squares of the frequencies of the distinct items in the stream and it serves as a reference point for evaluating the accuracy of the AMS estimates.

### 4.2 $\underline{\text{Estimation Results}}$:
- Estimation of 2nd moment with $ k = 10 $: 514,000
- Estimation of 2nd moment with $ k = 50 $: 395,600
- Estimation of 2nd moment with $ k = 100 $: 393,200

### 4.3 $\underline{\text{Observations}}$:

#### 4.3.1 $\underline{\text{Impact of Sample Size k}}$:

   - As expected, the accuracy of the AMS estimator improves as $k$ (the number of samples) increases.

   - With a $\textcolor{red}{\text{small sample size k = 10}}$, the estimate is **514,000**, which $\textcolor{red}{\text{deviates significantly from the true second moment}}$ of **400,114** (about $\textcolor{red}{\text{28\% higher}}$ than the true value). This reflects the $\textcolor{red}{\text{high variability}}$ in estimates when fewer samples are used.

   - For $\textcolor{orange}{\text{k = 50}}$, the estimate of **395,600** is $\textcolor{orange}{\text{much closer to the true value}}$ (a $\textcolor{orange}{\text{deviation of only 1.1\%}}$), suggesting that the algorithm is working well as the sample size grows.

   - With $\textcolor{green}{\text{k = 100}}$, the estimate of **393,200** is as well close to the true second moment (a $\textcolor{green}{\text{deviation of 1.7\%}}$, showing that a larger sample size further improves the precision of the estimate, though the $\textcolor{orange}{\text{improvement between k = 50}}$ is slightly better.

#### 4.3.2 $\underline{\text{Variability in Smaller Samples}}$:

   - The large difference for $\textcolor{red}{\text{k = 10}}$ highlights the randomness inherent in the reservoir sampling technique when only a few samples are taken. $\textcolor{red}{\text{With k = 10}}$, the AMS algorithm has fewer opportunities to track the distribution of frequencies accurately, leading to $\textcolor{red}{\text{more variability and less precision}}$.

   - With $\textcolor{green}{\text{larger k values like k = 50 and k = 100}}$, the sample is better able to capture the distribution of elements in the stream, leading to $\textcolor{green}{\text{more accurate estimates}}$.

#### 4.3.3 $\underline{\text{Trade-off between Memory and Accuracy}}$:
   - $\fbox{Memory}$: The value of $k$ directly determines how much memory the AMS algorithm uses, as it needs to store $k$ sampled values and their counts. $\textcolor{orange}{\text{Larger k values provide better estimates but require more memory}}$.

   - $\fbox{Accuracy}$: As seen in the results, the $\textcolor{orange}{\text{estimate becomes more accurate as k increases}}$. This illustrates the trade-off: using more memory (larger $k$) improves the accuracy of the estimate, $\textcolor{orange}{\text{but with diminishing returns beyond a certain point}}$ (e.g., the small improvement from $k = 50$ to $k = 100$).

#### 4.3.4 $\underline{\text{Asymptotic Behavior}}$:
   - In theory, as $k$ approaches infinity, the AMS estimate will converge to the true second moment. In practice, increasing $k$ from $50$ to $100$ produces only a marginal improvement, suggesting that beyond a certain threshold, $\textcolor{red}{\text{the gains in accuracy may not justify the additional memory cost}}$.
