GitHub - DAVIDCRUZ0202/pysimstudy: Python Port of the R package for synthetic data generation

Simulacra - Fake Data Synthesis

Goals

This projects aims to help data scientists in easily creating fake datasets for algorithm testing, model validation and general purpose data generation (i.e. accelerators and education).

Introduction to simstudy

Before generating synthetic data with simstudy, you first need to understand the two layers involved in the process:

Data definition, in which the user specifies the distribution she wishes to draw from, as well as its parameters. A neat feature of simstudy is the ability for users to specify relationships between inputs and outputs very easily, by defining a variable as a function of another variable explicitely, or by passing a correlation matrix as an argument.
Data generation, in which the user calls a set of functions to generate the data based on the definitions provided in the previous step

For an in-detail discussion about simstudy, please refer to the R package's original documentation as well as Keith Goldfeld's excellent [posts] (https://www.r-bloggers.com/author/keith-goldfeld/)

How to use the package - generating biased synthetic data

A simple demo in which we define bias as an decreased (increased) likelihood of witnessing a succesful (unsuccesful) outcome due to being part of a certain group, holding all else equal.

Scenario 1 - An Unbiased Loan Approval Data Generation Process

In scenario 1 we simulate a simple loan approval process using simstudy. In the first step, the data definition step, we define a normally distributed income_standardized variable with a mean of 1 and a standard deviation of 1. This variable represents an individual's income as a Z-score - instead of taking its raw value, we express income in terms of its relative standard deviation.

We then simulate a loan approval process in which all individuals have a baseline 50% approval chance +/- income_standardized / 10. In essence, we are saying that every standard deviation increase is associated with a 10% probabilitiy increase of receiving a loan. Individuals on the higher (lower) end of the income scale are very (un)likely to receive a loan.

df = defData(varname = "income_standardized", formula=0,
             variance=1, dist="normal")
 
# add a new data definition to previously defined data definition table
df = defData(df, varname="approval", formula='0.5*(income_standardized/10)', dist='binary')

# generate 10000 datapoints based on definitions
data = genData(10000, df)

Scenario 2 - A Biased Loan Approval Data Generation Process

Now we simulate a biased approval process. We first generate a categorical column category with 3 values, all with the same 33% probability of being drawn. We then use the defCondition function to create biased approval process. When an individual's category is blue, her baseline approval is 40%, while others have a 50% baseline approval. This is a form of direct bias and can be picked up quite rapidly looking at summary statistics of tthe joint-distribution of color and approval.

df2 = defData(varname = "category", formula="0.333, 0.333, 0.333",
             variance="red, blue, green", dist="categorical")

# add categorical data
data = addColumns(df2, data)

defC = defCondition(condition = "category=='blue'", formula = "0.4+income_standardized/10",
                    dist = "binary")

defC = defCondition(defC, condition = "category!='blue'", formula = "0.5+income_standardized/10",
                    dist = "binary")

# add a target column
data_biased = addCondition(defC, data, newvar="approval_bias")

data_biased.groupby('category')['approval_bias'].mean()

Installation

Use a personal auth token to download

git clone https://github.ibm.com/dse-rnd-incubator/simulacra-fake-data

Create a working branch

git checkout -b "Branch-name"

Before working, pull from master to update changes

git pull origin master

After working in your local branch, add commit and push changes to your branch

'git add (insert filenames here)'
'git commit -m "add comment message for commit"'
'git push origin branch-name'

We have not yet published this package, so the only way to use it is to install the repository onto your local machine via a github download. This is an internal project for now , so any contribution is welcome.

Additional Information

Use this style guide for python best coding practices. https://www.python.org/dev/peps/pep-0008/

Roadmap

Porting simstudy codebase to Python
Creating UI
Integrating more functionalities not included in simstudy from other packages or our own code

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Demos		Demos
dist		dist
py_scripts		py_scripts
src		src
temp_cpp_py		temp_cpp_py
tests		tests
.gitignore		.gitignore
IRS-Logo-1862 (1).png		IRS-Logo-1862 (1).png
IRS-Logo-1862.png		IRS-Logo-1862.png
LICENSE		LICENSE
README.md		README.md
approval-fake.csv		approval-fake.csv
code-testing.ipynb		code-testing.ipynb
dist_summary.png		dist_summary.png
download (1).png		download (1).png
download.png		download.png
janes.png		janes.png
logo-1453509487 (1).png		logo-1453509487 (1).png
logo-1453509487.png		logo-1453509487.png
penn-state-shield.jpg		penn-state-shield.jpg
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simulacra - Fake Data Synthesis

Goals

Introduction to simstudy

How to use the package - generating biased synthetic data

Scenario 1 - An Unbiased Loan Approval Data Generation Process

Scenario 2 - A Biased Loan Approval Data Generation Process

Installation

Additional Information

Roadmap

About

Releases 1

Packages

Contributors 2

Languages

License

DAVIDCRUZ0202/pysimstudy

Folders and files

Latest commit

History

Repository files navigation

Simulacra - Fake Data Synthesis

Goals

Introduction to simstudy

How to use the package - generating biased synthetic data

Scenario 1 - An Unbiased Loan Approval Data Generation Process

Scenario 2 - A Biased Loan Approval Data Generation Process

Installation

Additional Information

Roadmap

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages