# Sampling probability analysis

During training, training set data is being feed by a custom data_wrangling.sample_generator() function, what it do basically are:
1. Set random seed (line 4)
2. Randomly generate n item ids (where n=batch_size) by sampling probability (data.sample_p) (line 10-12)
3. Retrieve actual data from data.x_train by idx (line 20)
4. Fill it to a n_timesteps dataframe (to clamp input over timesteps) (line 18-20)
5. Repeat 2-5 while being called

For simplicity, below showing a version without semantic input

In [None]:
def sample_generator(cfg, data):
    # Dimension guide: (batch_size, timesteps, nodes)

    np.random.seed(cfg.rng_seed)
    epoch = 0
    batch = 0

    while True:
        batch += 1
        idx = np.random.choice(
            range(len(data.sample_p)), cfg.batch_size, p=data.sample_p
        )

        # Preallocate for easier indexing: (batch_size, time_step, input_dim)
        batch_s = np.zeros((cfg.batch_size, cfg.n_timesteps, cfg.output_dim))
        batch_y = []

        for t in range(cfg.n_timesteps):

            batch_y.append(data.y_train[idx])

        if batch % cfg.steps_per_epoch == 0:
            epoch += 1  # Counting epoch for ramping up input S

        yield (data.x_train[idx], batch_y)

Basically, data.sample_p control how often the model see a given word during training

## sample_p
There are two implementations in calculating sample_p
1. HS04
2. Jay

In HS04 implementation:
    

> The frequency of each item was coded using a square-root compression of the Wall Street Journal (WSJ) corpus (Marcus, Santorini, & Marcinkiewicz, 1993) according to the formula
> $p_i=\frac{\sqrt{f_i}}{\sqrt{m}}$ ... (6)
> where fi is the WSJ frequency of the ith item and m is 30,000 (a reasonable cutoff frequency). Values over 1.0 were set to 1.0; those less than 0.05 were set to 0.05.


In Jay implementation:
``` python
> # generate frequency proportions (probably could be a separate function)
> train[4] = [min(x, 10000) for x in train[3]]  # cap frequency at 10k
> train[5] = np.sqrt(train[4])  # take the SQRT of the capped frequency
> train[6] = train[5] / train[5].sum()  # generate proportion for sampling
```

In [None]:
class wf_manager():
    # Note: the probability must sum to 1 when passing it to np.random.choice()
    def __init__(self, wf):
        self.wf = np.array(wf)

    def to_p(self, x):
        """
        Calculate proportion p, i.e. Sum(p) = 1 in training set
        """
        return x / np.sum(x)
    
    def root_freq(self):
        return np.sqrt(self.wf)

    def samp_hs04(self):
        """
        HS04 sampling p implementation
        """
        root = np.sqrt(self.wf) / np.sqrt(30000)  # Formula 6 (HS04)
        clip = root.clip(0.05, 1.0)               # Values over 1.0 were set to 1.0; those less than 0.05 were set to 0.05.
        return self.to_p(clip)                    # generate proportion for sampling
    
    def samp_jay(self):
        """
        Jay's sampling p implementation
        Identical to script from Jay (support.py)
        """
        cap = self.wf.clip(0, 10000)              # cap frequency at 10k
        root = np.sqrt(cap)                       # take the SQRT of the capped frequency
        return self.to_p(root)                    # generate proportion for sampling
    
    def samp_jay_in_hs04_style(self):
        """
        Jay's sampling p in HS04 style
        """
        root = np.sqrt(self.wf) / np.sqrt(10000)
        clip = root.clip(0., 1.)
        return self.to_p(clip)

## Use the current pipeline to generate sampling_p

In [None]:
import pandas as pd
import numpy as np
df_train = pd.read_csv('../common/input/df_train.csv', index_col=0)
wf = wf_manager(df_train['wf'])

Minor Checking: samp_jay == samp_jay_in_hs04_style, upto 1e-16...

In [None]:
assert any(wf.samp_jay() - wf.samp_jay_in_hs04_style() > 1e-16) == False

Create a data file for plotting

In [None]:
import altair as alt
import pandas as pd 
alt.data_transformers.enable("default")
alt.data_transformers.disable_max_rows()

df = pd.DataFrame({'p_jay': wf.samp_jay(), 'p_hs04': wf.samp_hs04(), 'root_wf': wf.root_freq()})
df = df.melt(id_vars='root_wf', var_name='sample', value_name='p')

Take a peek at the data file

In [None]:
df.sample(10)

# Distribution of root frequency

In [None]:
alt.Chart(df).mark_area().encode(
    alt.X('root_wf', bin=alt.Bin(step=10)),
    alt.Y('count()')
).interactive()

- Root frequency is still heavily skewed to the right

In [None]:
pchart = alt.Chart(df).mark_line().encode(x='root_wf', y='p', color='sample').interactive()
pchart

- Actually it is the same graph I shown last time... but x-axis changed to root for easier scaling
    - when x = 0-7 (approximately), p_hs04 > p_jay 
    - when x = 7 to 125, p_jay > p_hs04
    - when x = 125 to inf, p_hs04 > p_jay
- Cuminative probability should be == 1 
    - Yes, but hard to see, since x-axis is very skewed... 
    - proof:

In [None]:
print('sum of all p in jay: {} and hs04: {}'.format(sum(wf.samp_jay()), sum(wf.samp_hs04())))

- Aera under the curve won't be equal, since x-axis is weighted (low end is heavier)

## Strain data

In [None]:
df_strain = pd.read_csv('../common/input/df_strain.csv', index_col=0)

Descriptive of root frequency in High frequency (HF) condition

In [None]:
np.sqrt(df_strain.loc[df_strain.frequency=='HF','wf'].describe())

Descriptive of root frequency in Low frequency (LF) condition

In [None]:
np.sqrt(df_strain.loc[df_strain.frequency=='LF','wf'].describe())

High-light 25-75 percentile on the chart for easier viewing

In [None]:
anno_df = [{
            "start": 41.5,
            "end": 104.7,
            "condition": "HF"
          },
          {
            "start": 9.2,
            "end": 22.1,
            "condition": "LF"
          }]

anno_df = pd.DataFrame(anno_df)

rect = alt.Chart(anno_df).mark_rect().encode(
    x='start',
    x2='end',
    color='condition:N',
    opacity=alt.value(0.8)
).interactive()

rect + pchart

- p_jay > p_hs04 most of the time in Strain data set
    - discrepancy increase as word frequency increase (until it hit the clipping point of p_jay)

In [None]:
!jupyter nbconvert --output-dir=. --to html sampling_check.ipynb