# Simulate Fake Data

This chunk of code was created for a different project so there's a few hoops in here that are there totally to make the cardinality highly skewed. Those don't matter here, but note that the final data set has parquet files that vary wildly in size. 

To run this you will need the following path in your working directory: `./output/df1/` 

In [None]:
import pandas as pd
import scipy.stats as st
import numpy as np

np.random.seed(42)

## Constants

These are constants used below, they are described in the text

In [None]:
# df1 constants:

bucket_n =  40_000   
key_n    =  7000  
group1_n = 10
group2_n = 20
group3_n = 30
group4_n = 40

P_maj1 = (160, 125125)     # parameters of the majority exp dist
P_min1 = (1006402, 8200726) # parameters of the minority exp dist


## `df1` - data with large record count

The first dataframe is going to be  `df1` and it is the larger table in terms of record count. The fields in `df1` are as follows:

`key`: the key. The number of unique keys is low compared to the number of records in each key

`bucket1`: Every key has the same number of unique values in `bucket1`. That number is set by the constant `bucket_n` above

`group1`: There are 4 groupings. Each of these groupings is assinged a key and they are later used in a group by 

`group2`: see above

`group3`: see above

`group4`: see above

`value`: This is the value we will sum later. It's random normal. 


## Simulate `df1`

Let's build `df1`:




Using the original problematic data, the number of records per bucket is distributed like a double exponential distribution (i.e. two exponential distributions mixed together). 95% of the buckets get their draw from a somewhat shorter tailed distribution, and 5% get their number of record draws from a much longer tail distribution.

This will give us roughly half a million sims per key if we use the exponential parameters set up above (`P_maj1` & `P_min1` ) and 40,000 buckets per key.

95% first:


In [None]:

draws_per_key_majority = st.expon.rvs(*P_maj1, size= round(.95 * key_n)).astype(int)

then the 5%:

In [None]:

draws_per_key_minority = st.expon.rvs(*P_min1,  size= round(.05 * key_n)).astype(int)

note that doing fractions from one dist then another fraction from another can end up with an off by one error. We might have 10 keys but only end up simulating 9. The probability of this happening goes down as number of keys goes up


In [None]:
draws_per_key = np.concatenate((draws_per_key_majority, draws_per_key_minority))

Total number of records that will be in `df1`:

In [None]:
sum(draws_per_key) 

So now we know how many buckets and how many records per bucket. So the simulation of `df1` will be to loop over the `draws_per_bucket` and draw that many observations with random groups. This could all be vectorized but I'm keeping this a loop to keep it readable

In [None]:
%%time
## 33 min on my MBP and generates ~52GB of parquet files

df1_list = []
key = 1

# simulate values for each bucket. Simulate a df with a single value for the key then randomly assign groups and values

for draws in draws_per_key:
    df = pd.DataFrame()
 
    df["key"] = np.resize(key, draws)
    
    df["bucket"] = np.resize(np.arange(1, bucket_n + 1), draws)

    df["group1"] = np.random.randint(low=1, high=group1_n, size=draws)
    df["group2"] = np.random.randint(low=1, high=group2_n, size=draws)
    df["group3"] = np.random.randint(low=1, high=group3_n, size=draws)
    df["group4"] = np.random.randint(low=1, high=group4_n, size=draws)
    df["value"] = np.random.random(size=draws)
    
    #df1_list.append(df)
    df.to_parquet(f'./output/df1/key_{str(key).zfill(5)}.parquet')
    
    key = key + 1
    

df.head()