#### Task: Find average number of repeated queries of users
#### Data: Stream of $3125000$ tuples in the form (user, query)
#### Assumptions: We only have disc space for $10\%$ of the stream

In [1]:
import numpy as np
import random
import pandas as pd
from collections import Counter
from google.colab import drive

In [2]:
!pip install mmh3
import mmh3

Collecting mmh3
  Downloading mmh3-5.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Downloading mmh3-5.1.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (101 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/101.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.6/101.6 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mmh3
Successfully installed mmh3-5.1.0


In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
def naive(el, a, b):
  x = np.random.randint(b)
  if x < a:
    return True
  return False


def prop_sampling(el, a, b):
  if mmh3.hash(el) % b < a:
    return True
  return False

**naive** function returns True around $\frac{a}{b} * $length of sample

**prop_sampling** function also returns True around $\frac{a}{b} * $length of sample, but it also takes into account the user. We first split samples by users into $b$ buckets and then choose $a$ buckets.

It can be summarized, that **naive** function samples randomly by *data*. This means that each user will have about $\frac{a}{b}$ of their queries sampled. **prop_sampling** function on the other hand samples randomly by *user*. This means that $\frac{a}{b}$ of user will have all of their queries sampled.

In [23]:
sample_naive = []
sample = []

with open('drive/MyDrive/log.txt', 'r') as f:
  lines = f.readlines()
  for line in lines:
    ip, adress = line.split(' ')
    if naive(ip, 1, 10):
      sample_naive.append(line.split())
    if prop_sampling(ip, 1, 10):
      sample.append(line.split())

In [9]:
sample[:5]

[['50.234.80.222', 'http://www.redips.coop'],
 ['133.98.10.112', 'http://www.becca.melbourne'],
 ['93.62.167.155', 'http://www.dragonfly.travel'],
 ['143.205.69.221', 'http://www.ayukawa.biz'],
 ['11.46.139.55', 'http://www.unremarkable.hamburg']]

In [10]:
sample_naive[:5]

[['121.190.79.2', 'http://www.viceless.int'],
 ['187.160.180.74', 'http://www.cerite.to'],
 ['247.92.181.99', 'http://www.contradiction.museum.post'],
 ['133.98.10.112', 'http://www.becca.melbourne'],
 ['12.53.210.174', 'http://www.bewitching.doha']]

In [11]:
len(sample) #around 10%

306250

In [12]:
len(sample_naive) #around 10%

311983

Until this point, both samples look similar

In [13]:
def average_query(sample): #this function counts average repeated queries per user
  C = Counter()
  user = []
  query_num = []

  for s in sample:
    C[tuple(s)] += 1

  for c, v in C.items():
    user.append(c[0])
    query_num.append(v)

  data = {'IP': user, 'query_num': query_num}
  df = pd.DataFrame(data)
  mean_query = df.groupby('IP')['query_num'].mean()
  return mean_query

First, let's see what happens on full stream (to see what we should expect from samples)

In [None]:
full = []
for line in lines:
  full.append(line.split())

In [22]:
average_query(full)

Unnamed: 0_level_0,query_num
IP,Unnamed: 1_level_1
0.121.29.95,1.253510
0.245.23.91,1.254013
0.44.201.13,1.253007
10.125.23.188,1.255524
10.220.49.64,1.253007
...,...
99.0.160.14,1.255524
99.225.132.208,1.254516
99.230.7.44,1.256029
99.24.82.194,1.253510


In [14]:
average_query(sample_naive)

Unnamed: 0_level_0,query_num
IP,Unnamed: 1_level_1
0.121.29.95,1.010714
0.245.23.91,1.022508
0.44.201.13,1.018315
10.125.23.188,1.009036
10.220.49.64,1.012780
...,...
99.0.160.14,1.023810
99.225.132.208,1.031359
99.230.7.44,1.030822
99.24.82.194,1.015924


In [15]:
average_query(sample)

Unnamed: 0_level_0,query_num
IP,Unnamed: 1_level_1
0.245.23.91,1.254013
104.107.150.164,1.256029
109.20.130.175,1.252003
109.88.115.225,1.255524
11.46.139.55,1.253007
...,...
93.62.167.155,1.254013
94.122.47.146,1.252505
96.164.2.242,1.254013
96.175.155.142,1.255524


Full stream returns $1000$ users with average number of repeated queries around $1.25$.

Naive sample returns $1000$ users, but their average number of repeated queries is around $1.01$, which is not correct answer.

Proper sample returns $98$ users (which is around $10\%$ of $1000$) with exactly the same average number of repeated queries. This is because a user will either have all of their queries sampled of none at all.