# TEST - DATA RANDOMIZER
This notebook initialises functions which can be used for generation of independent data. The other powerful methods in this repo generate synthetic data based on an already existing dataset. This notebook includes functions which take in user inputs which could easily be ported into an app and by simply using different prompts, a lot of features could be added.

Types of data, features and how to randomize choosing them:

* Integer: min/max, distribution, specific probabilities (weighted random)
* Float: min/max, distribution, specific probabilities (weighted random)
* categorical (types) : specific probabilities (weighted random), pseudo random (coin toss)
* dates/timestamps: start date/end date, amount per day, amount at specific times
* Boolean : specific probabilities (weighted random), pseudo random (coin toss)

## Imports & Configuration

In [1]:
# Imports
import warnings # Must be first

import pandas as pd
import numpy as np

from datetime import datetime
from CastConverter import *
from DataRandomizer import DatasetGenerator, DatetimeRandomizer, NumberRandomizer

# Configuration
field_name = 'numbers'
size = 3
dg = DatasetGenerator()
nr = NumberRandomizer()
dtr = DatetimeRandomizer()

## Array creation

### 1. Number

In [None]:
#params = { 'df': 1 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.CHISQUARE,
#                            size=3,
#                            params=params)

#params = { 'scale': 1 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.EXPONENTIAL,
#                            size=3,
#                            params=params)

#params = { 'shape': 1.99 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.GAMMA,
#                            size=3,
#                            params=params)

#params = { 'mu': 0, 'beta': 0.1 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.GUMBEL,
#                            size=size,
#                            params=params)

#params = { 'loc': 0, 'scale': 1 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.LAPLACE,
#                            size=size,
#                            params=params)

#params = { 'loc': 0, 'scale': 0.1 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.LOGISTIC,
#                            size=size,
#                            params=params)

#params = { 'df': 3, 'nonc': 20 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.NONCENTRALCHISQUARE,
#                            size=size,
#                            params=params)

#params = { 'dfnum': 3, 'dfden': 20, 'nonc': 3 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.NONCENTRALF,
#                            size=size,
#                            params=params)

#params = { 'mean': 10, 'std': 5 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.NORMAL,
#                            size=size,
#                            params=params)

#params = { 'shape': 3 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.PARETO,
#                            size=size,
#                            params=params)

#params = { 'lam': 5 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.POISSON,
#                            size=size,
#                            params=params)

#params = { 'a': 5 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.POWER,
#                            size=size,
#                            params=params)

#params = { 'scale': 1.1 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.RAYLEIGH,
#                            size=size,
#                            params=params)

#params = { 'df': 5 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.STDT,
#                            size=size,
#                            params=params)

#params = { 'left': -5, 'mode': 0, 'right': 5 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.TRIANGULAR,
#                            size=size,
#                            params=params)

#params = { 'min': 0, 'max': 1 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.UNIFORM,
#                            size=size,
#                            params=params)

#params = { 'mu': 0, 'kappa': 4 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.VONMISES,
#                            size=size,
#                            params=params)

#params = { 'mean': 3, 'scale': 2 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.WALD,
#                            size=size,
#                            params=params)

#params = { 'shape': 5 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.WEIBULL,
#                            size=size,
#                            params=params)

#params = { 'a': 5, 'weights': [0.1, 0, 0.3, 0.6, 0] }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.WEIGHTED,
#                            size=size,
#                            params=params)

#params = { 'a': 4 }
#numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.ZIPF,
#                            size=size,
#                            params=params)

#numbers

### 2. Number - Complete example

In [3]:
# Calls the randomizer outputting a pandas df of type int
params = { 'mean':10, 'std':5, 'min':0, 'max':20 }

numbers = nr.get_numbers(distribution=NumberRandomizer.Distribution.NORMAL,
                            size=size,
                            params=params)
# Convert floats to integers if required
int_array = convert_floats_to_ints(numbers)
# Insert a new column into a dataframe
dfNumbers = pd.DataFrame(data=int_array, columns=[field_name])

# Cleaning
del dfNumbers, int_array, numbers

### 3. Categorical

In [None]:
# Calls the randomizer or randomizes without distribution outputting a pandas df for categorical data
params = { 'a': 5, 'weights': [0.1, 0.1, 0.3, 0.4, 0.1] }
categories = nr.get_numbers(distribution=NumberRandomizer.Distribution.WEIGHTED,
                            size=10,
                            params=params)

# Convert floats to integers if required
int_categories = convert_floats_to_ints(categories)
int_categories

### 4. Boolean

In [None]:
# 
params = { 'a': [True, False], 'weights': [0.5, 0.5] }
categories = nr.get_numbers(distribution=NumberRandomizer.Distribution.WEIGHTED,
                            size=10,
                            params=params)

# Convert floats to integers if required
int_categories = convert_floats_to_ints(categories)
int_categories

### 5. Timestamp

In [None]:
params = { 'start': '1/1/2024', 'end': '1/11/2024', 'periods': 4, 'freq': None }
#params = { 'start': '1/1/2024', 'end': None, 'periods': 4, 'freq': "D" }

values = dtr.get_timestamps("tstamp", params)
values = dtr.get_timestamps_pd("tstamp", params)
values

## Dataframe Creation

### 1. "Manual" example

In [None]:
size = 100
params = { 'a': 5,
          'end': '1/11/2024',
          'freq': None,
          'mean': 10,
          'periods': size,
          'start': '1/1/2024',
          'std': 5,
          'weights': [0.1, 0.1, 0.3, 0.4, 0.1]
          }

# Timestamp
timestamps = dtr.get_timestamps_pd("when", params)

# Category
tmp_categories = nr.get_numbers(distribution=NumberRandomizer.Distribution.WEIGHTED,
                                size=size,
                                params=params)

int_categories = convert_floats_to_ints(tmp_categories)
categories = pd.DataFrame({'category':int_categories})

# Pricing
prices = nr.get_numbers(distribution=NumberRandomizer.Distribution.NORMAL,
                        size=size,
                        params=params)

# Concatenate all
data = pd.concat([timestamps], ignore_index=False, axis=1)
#data = pd.concat([timestamps, categories], ignore_index=True, axis=1)
#data = pd.concat([timestamps, int_categories, prices], ignore_index=True, axis=1)
data.head()


### 2. "Automated" example

In [3]:

# Timestamp - When
params_a = { 'field': 'when',
          'fieldtype': DatasetGenerator.FieldType.DATETIME,
          'end': '1/11/2024',
          'freq': None,
          'periods': size,
          'start': '1/1/2024',
          }

# Category - Risk
params_b = { 'field': 'risk',
          'fieldtype': DatasetGenerator.FieldType.NUMBER,
          'distribution': NumberRandomizer.Distribution.WEIGHTED,
          'a': 3,
          'weights': [0.1, 0.3, 0.6]
          }

# Number - Score
params_c = { 'field': 'score',
          'fieldtype': DatasetGenerator.FieldType.NUMBER,
          'distribution': NumberRandomizer.Distribution.NORMAL,
          'max': 20,
          'mean': 10,
          'min': 0,
          'std': 5
          }

arr_params = [params_a, params_b, params_c]

df = dg.get_dataframe(arr_params=arr_params, size=10)
df.head()


DatasetGenerator.FieldType.DATETIME
DatasetGenerator.FieldType.NUMBER
DatasetGenerator.FieldType.NUMBER


Unnamed: 0,when,risk,score
0,2024-01-01,2.0,16.137763
1,2024-01-06,2.0,11.1172
2,2024-01-11,2.0,7.715829
3,NaT,1.0,14.764872
4,NaT,1.0,10.264536


# Generating telemetric data

In [None]:
features = ['Timestamp','x1', 'y1', 'x2', 'y2', 'classes', 'scores']
ftype = ['Timestamp', 'Float', 'Float', 'Float', 'Float', 'Categorical', 'Float']
distribution = ['N/A', 'Normal', 'Normal', 'Normal', 'Normal', 'Random', 'Normal']

telemetric = DataGenerator(features, ftype, distribution)
telemetric.head()

In [None]:
def box_format(df): # completely dependent on data, this just joins 4 columns to one with the string [a,b,c,d] 
    temp = []
    new = pd.DataFrame()
    new = df.copy()
    for i in range(len(df.x1.tolist())):
        temp.append('[' + str(df.x1.tolist()[i]) + ', ' +
                    str(df.y1.tolist()[i]) + ', ' + str(df.x2.tolist()[i]) +
                    ', ' + str(df.y2.tolist()[i]) + ']')
    new.insert(1, 'sensors.video.boxes', temp)
    new.drop(columns=['x1', 'y1', 'x2', 'y2'], inplace=True)
    return new

In [None]:
telemetric_boxed = box_format(telemetric)

In [None]:
telemetric_boxed

In [None]:
telemetric_json = telemetric_boxed.to_json(orient='values', date_format='iso')

In [None]:
file = open('data/generated_telemetricdata.json', 'w')
file.write(telemetric_json)
file.close()

# Additional comments

As mentioned in the description, these functions could serve as the backend of a lot of feature functionality within an app. Maxime's main desrire for the customization aspect of data generation has been an example where a user could ask there to be fewer items at certain points in the day compared to others. 

An ineficient, but functional, way of accomplishing this would be to have the user input (the same way they did for the rest of the data) their desired timeframes and amounts of messages, and then call the _dataGenerator_ function seperately for the different timeframes ie. they ask for 1000 messages an hour between 9am and 6pm but 100 between 6pm and 9am, we can use the data generator for 9am to 6pm with the requested distribution etc and then use it again between 6pm and 9am but with a lower frequency and merge the datasets together. Moreover, if this was on a longer time frame such as 1 week where they also requested the 9am to 6pm requirement; we could simply make two datasets on that week with different amounts of messages per hour, then select the 9am to 6pm and 6pm to 9am slices and merge them. These brute force approaches would require the exact same inputs from the user as more sophisticated algorithms but would likely take longer.

**Note** frequency specification business hours can be inluded which helps exclude weekends

Another potential use case would be a never ending message generator which adds a message every determined amount of time. This could also be added as a functionality with simple use of python's integrated time function but needs to be investigated further. 