# Calculate mean and std for entire dataset
When scaling the input data during preprocessing, it is common practice to normalize data with a fixed mean and std (rather than individual on file level). So here we will simply calculate mean and std for each detector using [Welford’s algorithm](https://pypi.org/project/welford/) across the entire dataset.

In [None]:
!pip install welford --user

In [None]:
import numpy as np
import pandas as pd
from welford import Welford
import glob
import json
from tqdm import tqdm
import tensorflow as tf
import tensorflow_datasets as tfds

# Train dataset
We create a tensorflow dataset here, not because we need to - just to practice.

In [None]:
train_files = glob.glob('../input/g2net-gravitational-wave-detection/train/*/*/*/*.npy')

In [None]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 1

def _parse_function1(filename):
    np_data = tf.io.read_file(filename)
    np_data = tf.strings.substr(np_data, 128, 98304) # header is 128 bytes (skip)
    np_data = tf.reshape(tf.io.decode_raw(np_data, tf.float64), (3, 4096))
    return np_data

train_ds = tf.data.Dataset.from_tensor_slices(train_files)
train_ds = train_ds.map(_parse_function1, num_parallel_calls=AUTOTUNE)
train_ds = train_ds.batch(BATCH_SIZE)

The Welford library is actually super slow, much faster to use a TF implementation here. But we first calculate mean and variance for a subset with the slow and the fast implementation as a quality check of the TF version.

In [None]:
w0 = Welford()
w1 = Welford()
w2 = Welford()

CHK_COUNT = 25000

for i in tqdm(range(CHK_COUNT)):
    d = np.load(train_files[i])
    w0.add_all(np.expand_dims(d[0], axis = 1))
    w1.add_all(np.expand_dims(d[1], axis = 1))
    w2.add_all(np.expand_dims(d[2], axis = 1))

Then the TF implementation:

In [None]:
def tf_welford(ds, cnt_limit=-1):
    ds_numpy = tfds.as_numpy(ds)
    w_mean = np.zeros(3, dtype=np.float64)
    w_var = np.zeros(3, dtype=np.float64)
    sumsq = np.zeros(3, dtype=np.float64)
    cnt = 0.0
    for da in tqdm(ds_numpy):
        cnt += 1.0
        for j in range(3):
            x = da[0,j]
            delta = tf.math.reduce_mean(x - w_mean[j]).numpy()
            w_mean[j] += delta / cnt
            # variance calculation deviates a little from Welford as it uses a batch of 4096 
            sumsq[j] += tf.math.reduce_sum(tf.math.multiply(x, x)).numpy()
            w_var[j] = (sumsq[j]/(cnt*4096.)) - w_mean[j]*w_mean[j]
    
        if cnt == float(cnt_limit):
            break 
    return w_mean, w_var

In [None]:
w1_mean, w1_var = tf_welford(train_ds, CHK_COUNT)

Let's compare the values computed by the two implementations:

In [None]:
stats = [[w0.mean[0], w1_mean[0], w0.mean[0]/w1_mean[0], w0.var_s[0], w1_var[0], w0.var_s[0]/w1_var[0]],
         [w1.mean[0], w1_mean[1], w1.mean[0]/w1_mean[1], w1.var_s[0], w1_var[1], w1.var_s[0]/w1_var[1]],
         [w2.mean[0], w1_mean[2], w2.mean[0]/w1_mean[2], w2.var_s[0], w1_var[2], w2.var_s[0]/w1_var[2]]]
dfs = pd.DataFrame(stats, columns=['PyPI mean', 'TF mean', 'PyPi/TF mean ratio', 'PyPI var', 'TF var', 'PyPi/TF var ratio'])
dfs

Yes, happy with that - good enough for this purpose! Now run through the entire train dataset:

In [None]:
w1_mean, w1_var = tf_welford(train_ds)

Store the mean and std in a json file for use in other notebooks:

In [None]:
train_stats = {}
train_stats['detector'] = []
for i in range(3):
    train_stats['detector'].append({
                'idx': i,
                'mean': w1_mean[i],
                'std': np.sqrt(w1_var[i])})
    
with open('train_stats.json', 'w') as fp:
    json.dump(train_stats, fp, indent=4)

# Test dataset
Repeat for the test dataset.

In [None]:
test_files = glob.glob('../input/g2net-gravitational-wave-detection/test/*/*/*/*.npy')

In [None]:
test_ds = tf.data.Dataset.from_tensor_slices(test_files)
test_ds = test_ds.map(_parse_function1, num_parallel_calls=AUTOTUNE)
test_ds = test_ds.batch(BATCH_SIZE)

In [None]:
w2_mean, w2_var = tf_welford(test_ds)

Store results to json as well:

In [None]:
test_stats = {}
test_stats['detector'] = []
for i in range(3):
    test_stats['detector'].append({
                'idx': i,
                'mean': w2_mean[i],
                'std': np.sqrt(w2_var[i])})
    
with open('test_stats.json', 'w') as fp:
    json.dump(test_stats, fp, indent=4)

# Summary
So is there a difference between then train and test data sets when it comes to mean and variance?

In [None]:
stats = []
for j in range(3):
    stats.append([w1_mean[j], w2_mean[j], w1_mean[j]/w2_mean[j], w1_var[j], w2_var[j], w1_var[j]/w2_var[j]])
dfs = pd.DataFrame(stats, columns=['Train mean', 'Test mean', 'Train/test mean ratio', 'Train var', 'Test var', 'Train/test var ratio'])
dfs

Yes - mean values vary quite a bit actually, while variance is very similar.