## Mirror Data Generator
- mirrorGen is an open source tool to generate synthetic data based on a correlation DAG, which describes the relation among the columns in the data. It can be used to produce "dirty" data, mirroring various bias in real life, which can be used in applications, such as classification and ranking tasks [[1]](https://arxiv.org/abs/2006.08688).

## Demo the usage of mirrorGen to generate data that is described by the DAG below.
- It simulate a dataset with 6 columns: 
    - G with values of 'M' and 'F'.
    - age with values in [20, 70].
    - job with values of 'E' and 'F'.
    - X1 with values in [-2, 2].
    - X2 with values from a Gaussian distribution G(0,1).
    - D with values of 'Y' and 'N'.
- The correlation among above columns are:
    - G affects age, job, and X1.
    - D is equally determined by age, job, X1, and X2 through the same weights for all the edges.

![DAG](dag_hiring.png "DAG") 

In [5]:
from mirror.nodes import *
from mirror.edges import *
from mirror.generator import Mirror
from mirror import erasers
import pandas as pd

### 1. Define the DAG and generate data

In [2]:
# name of folder to save the synthetic data
data_flag = "hiring"

# size of the data
total_n = 30000 

# initialize nodes
node_g = CategoricalNode("G", {"M": 0.5, "F": 0.5}, sample_n=total_n)
node_a = OrdinalGlobalNode("age", min=20, max=70)
node_r = CategoricalNode("job", {"E", "F"})

node_x1 = GaussianNode("X1")
node_x2 = GaussianNode("X2", miu=0, var=1, sample_n=total_n)
node_d = CategoricalNode("D", {"Y", "N"}) # the value does not have meaning. Only the keys matter for the domain

# initialize edges
edge_g_a = CtoN("G", "age", {"M": ["Gaussian", 30, 10], "F": ["Gaussian", 45, 10]})
edge_g_r = CtoC("G", "job", {"M": {"E": 0.6, "F": 0.4}, "F": {"E": 0.4, "F": 0.6}})
edge_g_x1 = CtoN("G", "X1", {"M": ["Gaussian", 1, 0.5], "F": ["Gaussian", 0, 1]})


edge_a_d = NtoC("age", "D", [50], [{"Y": 0.8, "N": 0.2}, {"Y": 0.2, "N": 0.8}])
edge_r_d = CtoC("job", "D", {"E": {"Y": 0.6, "N": 0.4}, "F": {"Y": 0.4, "N": 0.6}})

edge_x1_d = NtoC("X1", "D", [0.5], [{"Y": 0.4, "N": 0.6}, {"Y": 0.6, "N": 0.4}])
edge_x2_d = NtoC("X2", "D", [0.5], [{"Y": 0.8, "N": 0.2}, {"Y": 0.2, "N": 0.8}])

# define DAG
nodes = [node_g, node_a, node_r, node_x1, node_x2, node_d]
edge_relation = {"X1": edge_g_x1,
                 "age": edge_g_a,
                 "job": edge_g_r,
                 "D": ([edge_x1_d, edge_r_d, edge_a_d, edge_x2_d],[0.25, 0.25, 0.25, 0.25])}



# generate data
mirror = Mirror(seed=0)
mirror.generate_csv(nodes, edge_relation)
mirror.save_to_disc("out/"+data_flag+"/R1.csv", excluded_cols=['C_X1', 'C_age', 'C_X2', 'group'])

print()

G independet ['G']
----------------------------------------

age with parents
One parent <mirror.edges.CtoN object at 0x121830c70> ['G', 'age']
----------------------------------------

job with parents
One parent <mirror.edges.CtoC object at 0x121830c40> ['G', 'age', 'job']
----------------------------------------

X1 with parents
One parent <mirror.edges.CtoN object at 0x121830cd0> ['G', 'age', 'job', 'X1']
----------------------------------------

X2 independet ['G', 'age', 'job', 'X1', 'X2']
----------------------------------------

D with parents
New CPT {'1E00': {'N': 0.3, 'Y': 0.7}, '1F00': {'N': 0.35, 'Y': 0.65}, '1E01': {'N': 0.45, 'Y': 0.55}, '0F01': {'N': 0.55, 'Y': 0.45}, '1F01': {'N': 0.5, 'Y': 0.5}, '0F00': {'N': 0.39999999999999997, 'Y': 0.6000000000000001}, '0E00': {'N': 0.35, 'Y': 0.65}, '0E01': {'N': 0.5, 'Y': 0.5}, '1F10': {'N': 0.5, 'Y': 0.5}, '0F10': {'N': 0.55, 'Y': 0.45}, '0E11': {'N': 0.65, 'Y': 0.35}, '0E10': {'N': 0.5, 'Y': 0.5}, '0F11': {'N': 0.7, 'Y': 0.3}, 

### 2. Simulate missing values in above generated data

### 2.1 Missing completely random

In [7]:
# define the missing patterns 
missing ='mcar'
# columns to inject missing values
applied_cols = ['job', 'X1']
# fraction of inject missing values
fraction = 0.2 
# encoding of the missing values
missing_values = {x:'?' for x in applied_cols}
# random seed to use
seed = 0

# initialize the eraser
perturbation = erasers.MCAR_eraser(applied_cols, fraction, missing_values, seed)
# read the data to inject the missings
data = pd.read_csv('out/hiring/R1.csv')
missing_data = perturbation.transform(data)
missing_data.to_csv('out/hiring/R1_'+missing+'.csv')

### 2.2 Missing not at random but depend on some other columns

In [10]:
missing = 'mar'
applied_cols = ['job', 'X1']
fraction = 0.2 
missing_values = {x:'?' for x in applied_cols}
seed = 0
# specify the columns on which the missing values are depend, e.g., both job and X1 depend on G
depends_on_cols = ['G', 'G']
# specify the order of the dependent column
# if categorical column the order is specified by the value of each category, e.g., {'G': {'M': 1, 'F': 0}} means sorting based on G and the order of 'M' and 'F'.
# if numerical column, then the order is specified by a weight of the column, e.g., {'X1': -1} means sorting in the reversed order of X1.
depends_on_cols_orders = {'G': {'M': 1, 'F': 0}}

perturbation = erasers.MAR_eraser(applied_cols, fraction, missing_values, depends_on_cols, depends_on_cols_orders, seed)

data = pd.read_csv('out/hiring/R1.csv')
missing_data = perturbation.transform(data)
missing_data.to_csv('out/hiring/R1_'+missing+'.csv')

### 2.3 Missing not at random but depend on the columns themself

In [11]:
missing = 'nmar'
applied_cols = ['job', 'X1']
fraction = 0.2 
missing_values = {x:'?' for x in applied_cols}
seed = 0
# specify the order of the columns to be injected with missing values 
# if categorical column the order is specified by the value of each category, e.g., {'job': {'F': 1, 'E': 0}} means sorting based on G and the order of 'M' and 'F'.
# if numerical column, then the order is specified by a weight of the column, e.g., {'X1': -1} means sorting in the reversed order of X1.
missings_cols_orders = {'job': {'F': 1, 'E': 0}, 'X1': -1} # order first by F and E and then by the reversed order of X1
perturbation = erasers.NMAR_eraser(applied_cols, fraction, missing_values, missings_cols_orders, seed)

data = pd.read_csv('out/hiring/R1.csv')
missing_data = perturbation.transform(data)
missing_data.to_csv('out/hiring/R1_'+missing+'.csv')