# Description
This notebook initialises functions which can be used for generation of independent data. The other powerful methods in this repo generate synthetic data based on an already existing dataset. This notebook includes functions which take in user inputs which could easily be ported into an app and by simply using different prompts, a lot of features could be added.

Types of data, features and how to randomize choosing them:

* Integer: min/max, distribution, specific probabilities (weighted random)
* Float: min/max, distribution, specific probabilities (weighted random)
* categorical (types) : specific probabilities (weighted random), pseudo random (coin toss)
* dates/timestamps: start date/end date, amount per day, amount at specific times
* Boolean : specific probabilities (weighted random), pseudo random (coin toss)

# Functions

## Imports

In [1]:
import pandas as pd
from datetime import datetime
import numpy as np
import warnings

## Distribution generator

This randomizer takes in input for the paramaters depending on the distribution selected. It also accepts min and max arguments for the output of the data set in a 'data smapling' method similar to SDV's "reject_sampling" method. Thanks to numpy's extremely efficient array creator, the sampling method used barely takes more time.

In [6]:
def Randomizer(object_name, distribution, size):
    new_data = np.array([])

    # Takes in min and max but if none are required or weighted distribution is requested on 
    # Categorical Data, the input can be 'None' and no sampling will be done
    MIN = float(
        input('Input the minimum value of the data of object \"' +
              object_name + '\", if not applicable type \'None\':'))
    MAX = float(
        input('Input the maximum value of the data of object \"' +
              object_name + '\", if not applicable type \'None\':'))
    
    # if Min  is greater than max the sampler will always empty the array and this will loop without end
    if MIN > MAX:
        warnings.warn('Minimum is higher than maximum!')  
        return None

    while len(new_data) < size: # Distribution generation loop
        if (distribution == 'Normal'):
            if len(new_data) < 1:
                mean = float(
                    input(
                        'Input mean value for Normal distribution of object \"'
                        + object_name + '\":'))
                std = float(
                    input(
                        'Input standard deviation value for Uniform distribution of object \"'
                        + object_name + '\":'))
            # Plan on adding possibility of inserting a df with column specified or list input possibility to find mena/std
            new_data = np.append(new_data, np.random.normal(mean, std, size))

        elif (distribution == 'Uniform'):
            if len(new_data) < 1:
                Min = float(
                    input(
                        'Input a value for Uniform distribution of object \"' +
                        object_name + '\":'))
                Max = float(
                    input(
                        'Input b value for Uniform distribution of object \"' +
                        object_name + '\":'))
            new_data = np.append(new_data, np.random.uniform(Min, Max, size))

        elif (distribution == 'Gamma'):
            if len(new_data) < 1:
                a = float(
                    input('Input a value for Gamma distribution of object \"' +
                          object_name + '\":'))
            new_data = np.append(new_data, gamma.rvs(a, size))

        elif (distribution == 'Exponential'):
            if len(new_data) < 1:
                a = float(
                    input('Input rate parameter of object \"' + object_name +
                          '\":'))
            new_data = np.append(new_data, np.random.exponential(a, size))

        elif (distribution == 'Weighted Distribution'):
            if len(new_data) < 1:
                val = input('Input array of values of object \"' +
                            object_name + '\":').split(',')
                weights = input(
                    'Input array of weights for the values of object \"' +
                    object_name + '\":').split(',').astype(float)
            new_data = np.append(new_data,
                                 np.random.choice(val, size=size, p=weights))

        elif (distribution == 'Weibull'):
            if len(new_data) < 1:
                a = float(
                    input(
                        'Input shape value for Weibull distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.weibull(a, size))

        elif (distribution == 'Poisson'):
            if len(new_data) < 1:
                lam = float(
                    input(
                        'Input expected number for Poisson distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.poisson(lam, size))

        elif (distribution == 'Zipf'):
            if len(new_data) < 1:
                a = float(
                    input(
                        'Input parameter value for zipf distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.zipf(a, size))

        elif (distribution == 'Wald'):
            if len(new_data) < 1:
                mean = float(
                    input(
                        'Input mean value for wald distribution of object \"' +
                        object_name + '\":'))
                scale = float(
                    input(
                        'Input scale value for wald distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.wald(mean, scale, size))

        elif (distribution == 'Vonmises'):
            if len(new_data) < 1:
                mu = float(
                    input(
                        'Input mu value for vonmises distribution of object \"'
                        + object_name + '\":'))
                kappa = float(
                    input(
                        'Input kappa value for vonmises distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.vonmises(mu, kappa, size))

        elif (distribution == 'Triangular'):
            if len(new_data) < 1:
                left = float(
                    input(
                        'Input left value for Triangular distribution of object \"'
                        + object_name + '\":'))
                mode = float(
                    input(
                        'Input mode value for Triangular distribution of object \"'
                        + object_name + '\":'))
                right = float(
                    input(
                        'Input right value for Triangular distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data,
                                 np.random.triangular(left, mode, right, size))

        elif (distribution == 'Standard T'):
            if len(new_data) < 1:
                df = float(
                    input(
                        'Input degrees of freedom value for standard_t distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.standard_t(df, size))

        elif (distribution == 'Rayleigh'):
            if len(new_data) < 1:
                scale = float(
                    input(
                        'Input scale value for rayleigh distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.rayleigh(scale, size))

        elif (distribution == 'Power'):
            if len(new_data) < 1:
                a = float(
                    input(
                        'Input paramater value for power distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.power(a, size))

        elif (distribution == 'Poisson'):
            if len(new_data) < 1:
                lam = float(
                    input(
                        'Input lamda value for Poisson distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.poisson(lam, size))

        elif (distribution == 'Pareto'):
            if len(new_data) < 1:
                lam = float(
                    input(
                        'Input lamda value for Pareto distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.pareto(lam, size))

        elif (distribution == 'Noncentral F'):
            if len(new_data) < 1:
                dfnum = float(
                    input(
                        'Input Numerator degrees of freedom value for Noncentral F distribution of object \"'
                        + object_name + '\":'))
                dfden = float(
                    input(
                        'Input Denominator degrees of freedom value for Noncentral F distribution of object \"'
                        + object_name + '\":'))
                nonc = float(
                    input(
                        'Input non-centrality value for Noncentral F distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(
                new_data, np.random.noncentral_f(dfnum, dfden, nonc, size))

        elif (distribution == 'Noncentral Chisquare'):
            if len(new_data) < 1:
                df = float(
                    input(
                        'Input degrees of freedom value for Noncentral Chisquare distribution of object \"'
                        + object_name + '\":'))
                nonc = float(
                    input(
                        'Input non-centrality value for Normal distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(
                new_data, np.random.noncentral_chisquare(df, nonc, size))

        elif (distribution == 'Logistic'):
            if len(new_data) < 1:
                loc = float(
                    input(
                        'Input loc value for logistic distribution of object \"'
                        + object_name + '\":'))
                scale = float(
                    input(
                        'Input scale value for logistic distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data,
                                 np.random.logistic(loc, scale, size))

        elif (distribution == 'Laplace'):
            if len(new_data) < 1:
                loc = float(
                    input(
                        'Input loc value for Laplace distribution of object \"'
                        + object_name + '\":'))
                scale = float(
                    input(
                        'Input scale value for Laplace distribution of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.laplace(loc, scale, size))

        elif (distribution == 'Gumbel'):
            if len(new_data) < 1:
                loc = float(
                    input(
                        'Input loc value for gumbel distribution of object \"'
                        + object_name + '\":'))
                scale = float(
                    input(
                        'Input scale value for gumbel distribution of object \"'
                        + object_name + '\":'))
            new_data = np.appendd(new_data, np.random.gumbel(loc, scale, size))

        elif (distribution == 'Chisquare'):
            if len(new_data) < 1:
                df = float(
                    input(
                        'Input degrees of freedom value for Normal Chisquare of object \"'
                        + object_name + '\":'))
            new_data = np.append(new_data, np.random.chisquare(df, size))

        if MAX != 'None': # Sampler removing values greater than max then smaller than min 
            new_data = new_data[new_data < MAX]
        if MIN != 'None':
            new_data = new_data[new_data > MIN]
    return new_data[:size] # selects requested size of values

## Feature generation by type

In [7]:
def IntegerGenerator(object_name, distribution, size):
    # Calls the randomizer outputting a pandas df of type int
    int_Array = Randomizer(object_name, distribution, size).astype(int)
    objectslice = pd.DataFrame(data=int_Array, columns=[object_name])
    return objectslice

In [8]:
def FloatGenerator(object_name, distribution, size):
    # Calls the randomizer outputting a pandas df of type float
    float_Array = Randomizer(object_name, distribution, size).astype(float)
    objectslice = pd.DataFrame(data=float_Array, columns=[object_name])
    return objectslice

In [9]:
def CategoricalData(object_name, randomization_type, size):
    # Calls the randomizer or randomizes without distribution outputting a pandas df for categorical data
    if randomization_type == 'Weighted Distribution':
        cat_arr = Randomizer(object_name, randomization_type, size)
        objectslice = pd.DataFrame(data=cat_arr, columns=[object_name])
    elif randomization_type == 'Random':
        val = input('Input array of values of object \"' + object_name +
                    '\":').split(',')
        new_data = np.random.choice(val, size=size)
        objectslice = pd.DataFrame(data=new_data, columns=[object_name])
    return objectslice

In [10]:
def TimestampGenerator(object_name,
                       start=None,
                       end=None,
                       periods=None,
                       frequency=None):
    # receives timestamp specifications and creates and outputs a pandas df of type timestamp
    objectslice = pd.DataFrame(pd.date_range(start=start,
                                             end=end,
                                             periods=periods,
                                             freq=frequency,
                                             name=object_name).to_pydatetime(),
                               columns=[object_name])
    return objectslice

In [11]:
def BooleanGenerator(object_name, randomization_type, size):
    # Calls the randomizer or randomizes without distribution outputting a pandas df for boolean data
    val = [True, False]
    if randomization_type == 'Weighted Distribution':
        cat_arr = Randomizer(object_name, randomization_type, size)
        objectslice = pd.DataFrame(data=cat_arr, columns=[object_name])
    elif randomization_type == 'Random':
        new_data = np.random.choice(val, size=size)
        objectslice = pd.DataFrame(data=new_data, columns=[object_name])
    return objectslice

# Dataframe Creation Quick Example

In [1]:
size = 1000
timestamps = TimestampGenerator('time', '12/10/2021', '04/11/2030',
                                periods = size)
items = CategoricalData('Item', 'Random', size)
values = FloatGenerator('Value', 'Normal', size)
ratings = IntegerGenerator('Rating', 'Uniform', size)

NameError: name 'TimestampGenerator' is not defined

table, chair, desk, chair, computer, window, book, emblem

In [None]:
data = pd.concat([timestamps, items, values, ratings],
                 ignore_index=True,
                 axis=1)
data.head()

# Data creation function

In [4]:
# Calls must be lists in order desired
# Features_names is a must strings
# Features types has to be Timestamp, Categorical, Float or Integer
# Distribution are the available distribution in randomizer + 'Random'. Takes input for timestamp but placeholder required
# Size is an integer


def DataGenerator(feature_names, features_type, distributions):
    values_ = pd.DataFrame()
    size = 0  # initialised for case when timestamp is not called first.
    dispatcher = {
        'Timestamp': TimestampGenerator,
        'Categorical': CategoricalData,
        'Float': FloatGenerator,
        'Integer': IntegerGenerator
    }  # enables function to be called from input

    if len(feature_names) != len(features_type) or len(feature_names) != len(
            distributions):
        warnings.error(
            'Not the same number of names, types and distributions!')
        return None

    for i in range(len(feature_names)):
        # Accepts input for timestamp to define the size of the df
        if features_type[i] == 'Timestamp':
            print(
                'Exactly 3 of the following inputs must be filled in, this will be responsible for the amount of messages created.'
            )
            # the pd.data_range accepts only 3 inputs of the 4 variables available, this creates a size of the array as specified
            start = input('Input start date for \"' + feature_names[i] +
                          '\", if not applicable type \'None\':')
            end = input('Input end date for \"' + feature_names[i] +
                        '\", if not applicable type \'None\':')
            size = int(
                input('Input the number of messages for \"' +
                      feature_names[i] +
                      '\", if not applicable type \'None\':'))
            frequency = input('Input the frequency of messages for \"' +
                              feature_names[i] +
                              '\", if not applicable type \'None\':')
            # For each case the dispatcher calls the timestamp function on the available variables and adds the new df column
            # created to the complete dataset, values_
            if start == 'None':
                values_ = pd.concat([
                    values_, dispatcher[features_type[i]](feature_names[i],
                                                          end=end,
                                                          periods=size,
                                                          frequency=frequency)
                ],
                                    ignore_index=True,
                                    axis=1)
            elif end == 'None':
                values_ = pd.concat([
                    values_, dispatcher[features_type[i]](feature_names[i],
                                                          start=start,
                                                          periods=size,
                                                          frequency=frequency)
                ],
                                    ignore_index=True,
                                    axis=1)
            elif frequency == 'None':
                values_ = pd.concat([
                    values_, dispatcher[features_type[i]](
                        feature_names[i], start=start, periods=size, end=end)
                ],
                                    ignore_index=True,
                                    axis=1)
            elif size == 'None':
                values_ = pd.concat([
                    values_, dispatcher[features_type[i]](feature_names[i],
                                                          start=start,
                                                          end=end,
                                                          frequency=frequency)
                ],
                                    ignore_index=True,
                                    axis=1)
                size = len(values_[i])
                
        # This is used to call all other functions but TimestampGenerator 
        else:
            if size == 0:
                size = input('Input size desired: ')
            values_ = pd.concat([
                values_, dispatcher[features_type[i]](feature_names[i],
                                                      distributions[i], size)
            ],
                                ignore_index=True,
                                axis=1)
    values_.columns = feature_names
    return values_

In [None]:
features = ['time','item', 'method', 'value', 'score', 'rating']
ftype = ['Timestamp', 'Categorical', 'Categorical', 'Integer', 'Float', 'Float']
distribution = ['N/A', 'Random', 'Random', 'Exponential', 'Normal', 'Uniform']

final_data = DataGenerator(features, ftype, distribution)
final_data.head()

Exactly 3 of the following inputs must be filled in, this will be responsible for the amount of messages created.


# Generating telemetric data

In [None]:
features = ['Timestamp','x1', 'y1', 'x2', 'y2', 'classes', 'scores']
ftype = ['Timestamp', 'Float', 'Float', 'Float', 'Float', 'Categorical', 'Float']
distribution = ['N/A', 'Normal', 'Normal', 'Normal', 'Normal', 'Random', 'Normal']

telemetric = DataGenerator(features, ftype, distribution)
telemetric.head()

In [None]:
def box_format(df): # completely dependent on data, this just joins 4 columns to one with the string [a,b,c,d] 
    temp = []
    new = pd.DataFrame()
    new = df.copy()
    for i in range(len(df.x1.tolist())):
        temp.append('[' + str(df.x1.tolist()[i]) + ', ' +
                    str(df.y1.tolist()[i]) + ', ' + str(df.x2.tolist()[i]) +
                    ', ' + str(df.y2.tolist()[i]) + ']')
    new.insert(1, 'sensors.video.boxes', temp)
    new.drop(columns=['x1', 'y1', 'x2', 'y2'], inplace=True)
    return new

In [None]:
telemetric_boxed = box_format(telemetric)

In [None]:
telemetric_boxed

In [None]:
telemetric_json = telemetric_boxed.to_json(orient='values', date_format='iso')

In [None]:
file = open('data/generated_telemetricdata.json', 'w')
file.write(telemetric_json)
file.close()

# Additional comments

As mentioned in the description, these functions could serve as the backend of a lot of feature functionality within an app. Maxime's main desrire for the customization aspect of data generation has been an example where a user could ask there to be fewer items at certain points in the day compared to others. 

An ineficient, but functional, way of accomplishing this would be to have the user input (the same way they did for the rest of the data) their desired timeframes and amounts of messages, and then call the _dataGenerator_ function seperately for the different timeframes ie. they ask for 1000 messages an hour between 9am and 6pm but 100 between 6pm and 9am, we can use the data generator for 9am to 6pm with the requested distribution etc and then use it again between 6pm and 9am but with a lower frequency and merge the datasets together. Moreover, if this was on a longer time frame such as 1 week where they also requested the 9am to 6pm requirement; we could simply make two datasets on that week with different amounts of messages per hour, then select the 9am to 6pm and 6pm to 9am slices and merge them. These brute force approaches would require the exact same inputs from the user as more sophisticated algorithms but would likely take longer.

**Note** frequency specification business hours can be inluded which helps exclude weekends

Another potential use case would be a never ending message generator which adds a message every determined amount of time. This could also be added as a functionality with simple use of python's integrated time function but needs to be investigated further. 