# Data Generation Notebook

This notebook serves as a way to generate random date in a reproduceable and configurable way. In order to run this notebook make sure the dependencies in requirements.txt are installed.

If there are any questions, please email erik.hakansson{at}gmail.com

In [None]:
# General Imports
import os
import random
import numpy as np
import pandas as pd

In [52]:
SEED = 42

# Generator
Will allow easy and reproduceable data generation.

In [53]:
def generate_dataset(seed: int, n_continous: int, n_categorical: int, sample_size: int) -> pd.DataFrame:
    """
    Helper function for generating a random but reproduceable df of specified 
    amount continous and categorical variables.
    
    seed: number for reproducing random generations
    n_continous: number of wanted continous cols
    n_categorical: number of wanted categorical cols (categories == unique english lowercase letters)
    sample_size: n rows of generated df
    """
    np.random.seed(seed)
    
    # genreate continous valued part of df
    continous_data = np.random.rand(sample_size, n_continous)
    
    # generate unique english lowercase letters and sample for categorical vars
    categories = list(map(chr, range(97,123)))
    categorical_data = np.random.choice(categories, size=(sample_size, n_categorical))

    # concatenate to be returned as dataframe
    data = np.concatenate((continous_data, categorical_data), axis=1)
    
    return pd.DataFrame(data)

# Generate DF according to instructions

Let's generate our data and make sure the columns are nicely labeled. Finally, we will export it to a csv file. This will enable a more modular (and realistic) way of stepping through the data science process.

In [54]:
# generate df and rename columns for more clarity
df = generate_dataset(seed=SEED, n_continous=3, n_categorical=1, sample_size=10000)

# edit feature names to be explciti, and denote one continous col as the prediction target
feature_names = ['feature' + str(col) for col in df.columns]
feature_names[0] = 'target'
df.columns = feature_names

In [55]:
df.head(5)

Unnamed: 0,target,feature1,feature2,feature3
0,0.3745401188473625,0.9507143064099162,0.7319939418114051,m
1,0.5986584841970366,0.1560186404424365,0.1559945203362026,f
2,0.0580836121681994,0.8661761457749352,0.6011150117432088,a
3,0.7080725777960455,0.0205844942958024,0.9699098521619944,f
4,0.8324426408004217,0.2123391106782761,0.1818249672071006,m


In [60]:
# export to csv for a more realistic experience.
DATA_DIR = "../data"
file_name = f"testData-{SEED}-raw.csv"
relative_path = os.path.join(DATA_DIR, file_name)

df.to_csv(relative_path)

# We've only just begun...
Now that we've generated data, the only step between developing a model and us is a little bit of preprocessing. Do note that we're skipping an EDA step, as we are dealing with random data after all...

Please head over to the <code>model.ipynb</code> for the continuation :)

##### NOTE: if you're looking for signs of feature engineering, we'll get there in the model notebook!