# Generating dummy datasets

This notebook describes an approach for generating dummy data

TODO
- add continuous distributions (e.g. property values, account values, limits, etc)
- add ability to simulate claims from dummy data

In [None]:
#set up environment
import numpy as np
import pandas as pd

## Basics

Below we go through a simple example. Lets say we want to add a categorical column to our data for Gender with two levels. We want the proportion of the males to be 45% of the data with the remaining as females. We could describe this using a Python dictionary as follows:

In [None]:
gender = {'M': 0.45,
          'F': 0.55}

In [None]:
cat = pd.DataFrame({'Probs': gender}) # converting to a Pandas dataframe

print(cat)

In [None]:
num_cats = cat.shape[0] # number of categories
num_cats

In [None]:
num_records = 100

Numpy has a useful function called Choice that is perfect for the task of generating samples of levels for our category

In [None]:
cat_sims = np.random.choice(num_cats, num_records, p=cat.Probs) # outputs a numpy array
cat_sims = pd.DataFrame({'Sims': cat_sims}) # converts array to data frame
cat_sims.head(10) # show first 10 rows

In [None]:
mapping = pd.DataFrame({'Mapping': cat.index.values}) # create a second dataframe of the labels of our category
mapping

In [None]:
# merge the sims and the labels to produce a column of labels in the proportions we are looking for
cat_sims_2 = pd.merge(cat_sims, mapping, right_index=True, left_on='Sims').sort_index()
cat_sims_2.head(10)

## Fuller Implementation

We can put the steps above into a function to make it easier to reuse (DRY principle).

In [None]:
def gen_categorical(data_frame, category, col_name, num_records=1000):
    """adds category as a new column to data_frame
    
    Parameters
    ----------
    data_frame: pd.DataFrame
        the input dataset
    category: dict
        defintion of the categorical of the form {category_label (str): proportion (float)}
        proportions should add up to 1
    col_name: str
        name for the column in the returned dataset
    num_records: int, optional
        defaults to 1000, only used if input data_frame is empty
        
    Returns
    -------
    pd.DataFrame
        input DataFrame with additional or overwritten columns
        
    Examples
    --------
    Examples should be written in doctest format, and should illustrate how
    to use the function.

    >>> df = get_categorical(df, {'M': 0.45, 'F': 0.55}, 'Gender')
    
    TODO
    --------
    Add checks that probabilities add to 1, check types
    
    """
    
    # wise to copy data_frame to avoid unintended inplace operations
    data_frame = data_frame.copy()
    
    # use existing dataframe length if available
    if dummy_data.shape[0] == 0: 
        num_records = num_records
    else:
        num_records = data_frame.shape[0]
    
    # turn category definition into DataFrame object
    cat = pd.DataFrame({'Probs': category}) 
    
    # generate the dummy data
    cat_sims = np.random.choice(cat.shape[0], num_records, p=cat.Probs) # simulate
    cat_sims = pd.DataFrame({'Sims': cat_sims})
    mapping = pd.DataFrame({'Mapping': cat.index.values})
    cat_sims_2 = pd.merge(cat_sims, mapping, right_index=True, left_on='Sims').sort_index()
    
    # assigns the dummy data to the data_drame
    data_frame[col_name] = cat_sims_2.Mapping
    
    return data_frame

Below we create some categories, initialise a dataframe and then loop through the categories adding them to the dataset

In [None]:
# set up categories as a dictionary
categories = {'Gender': {'M': 0.45,
                         'F': 0.55},
              
              'Product': {'A': 0.4,
                          'B': 0.2,
                          'C': 0.35,
                          'D': 0.05}
             }

# initialise data frame
dummy_data = pd.DataFrame()

# run for loop to iterate over the categories calling the gen_categorical function to generate the data
for cat_name, cat_value in categories.items():
    dummy_data = gen_categorical(dummy_data, cat_value, cat_name, num_records=5000)

# first few rows to check
dummy_data.head(20)

In [None]:
path = 'C:\\Users\\U006256\\Desktop\dummy_data.xlsx' # update to where you want to save the file
dummy_data.to_excel(path)