## Data Simulation
* Here we simulate the production line data in three different scales: 500, 1000, and 2000 to investigate the performance of the score-based causal discovery algorithms on different scales of data.
* The boostrapping is conducted with 10 iterations on the sample size of 500 due to the resource limit.

In [1]:
import pandas as pd
from causalAssembly.models_dag import ProductionLineGraph

# Define sample sizes
sample_sizes = [500, 1000, 2000]

# Step 1: Generate the simulated dataset
print("Generating simulated dataset...")
assembly_line_data = ProductionLineGraph.get_data()

# Step 2: Retrieve the ground truth DAG
print("Retrieving ground truth DAG...")
assembly_line_ground_truth = ProductionLineGraph.get_ground_truth()

# Step 3: Filter dataset columns for Stations 1, 2, and 3
station_prefixes = ["Station1_", "Station2_", "Station3_"]
filtered_columns = [
    col for col in assembly_line_data.columns if any(col.startswith(prefix) for prefix in station_prefixes)
]

# Step 4: Generate datasets for three sample sizes
for size in sample_sizes:
    # Create a filtered dataset for the current sample size
    sampled_data = assembly_line_data[filtered_columns].sample(n=size, random_state=42)
    
    # Save the filtered dataset to a CSV file
    file_name = f"station_data_{size}.csv"
    sampled_data.to_csv(file_name, index=False)
    print(f"Filtered dataset with {size} samples saved as '{file_name}'.")

# Step 5: Filter the ground truth graph for Stations 1, 2, and 3
station_nodes = [
    node for node in assembly_line_ground_truth.ground_truth.index
    if any(node.startswith(prefix) for prefix in station_prefixes)
]

# Create a filtered adjacency matrix
filtered_adjacency_matrix = assembly_line_ground_truth.ground_truth.loc[station_nodes, station_nodes]

# Save the filtered adjacency matrix to a CSV file
filtered_adjacency_matrix.to_csv("station_ground_truth.csv", index=True)
print("Ground truth adjacency matrix for Stations 1-3 saved as 'station_ground_truth.csv'.")

Generating simulated dataset...
Retrieving ground truth DAG...
Filtered dataset with 500 samples saved as 'station_data_500.csv'.
Filtered dataset with 1000 samples saved as 'station_data_1000.csv'.
Filtered dataset with 2000 samples saved as 'station_data_2000.csv'.
Ground truth adjacency matrix for Stations 1-3 saved as 'station_ground_truth.csv'.
