# Data Generation for Clustering Analysis

This notebook demonstrates how to generate synthetic datasets using the MixSim methodology for clustering algorithm evaluation.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from pathlib import Path

# Add src to path
sys.path.append('../src')

from clustering_analysis import SyntheticDataGenerator, ClusteringVisualizer

## Initialize Data Generator

Create a synthetic data generator that will create datasets with controlled overlap between clusters.

In [None]:
# Initialize data generator
generator = SyntheticDataGenerator(output_dir="../data/synthetic")

print("Data generator initialized!")

## Generate Sample Datasets

Let's generate datasets with different overlap levels (BarOmega) to see how cluster separation affects algorithm performance.

In [None]:
# Parameters for dataset generation
K = 3  # Number of clusters
p = 2  # Number of dimensions (2D for visualization)
n = 1000  # Number of samples

# Different overlap levels
overlap_levels = [0.0, 0.1, 0.3, 0.5]

# Generate datasets
datasets = {}
for bar_omega in overlap_levels:
    print(f"Generating dataset with BarOmega = {bar_omega}...")
    X, y = generator.generate_dataset(bar_omega, K, p, n, random_state=42)
    datasets[bar_omega] = (X, y)
    print(f"  Shape: {X.shape}, Unique labels: {len(np.unique(y))}")

## Visualize Generated Datasets

Let's visualize how different overlap levels affect cluster separation.

In [None]:
# Plot datasets with different overlap levels
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

colors = ['red', 'blue', 'green', 'orange', 'purple']

for i, (bar_omega, (X, y)) in enumerate(datasets.items()):
    ax = axes[i]
    
    # Plot each cluster with different color
    for cluster_id in np.unique(y):
        mask = y == cluster_id
        ax.scatter(X[mask, 0], X[mask, 1], 
                  c=colors[cluster_id], alpha=0.7, 
                  label=f'Cluster {cluster_id}', s=20)
    
    ax.set_title(f'BarOmega = {bar_omega}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Dataset Statistics

Let's examine the statistics of our generated datasets.

In [None]:
# Calculate statistics for each dataset
stats_data = []

for bar_omega, (X, y) in datasets.items():
    # Calculate basic statistics
    stats = {
        'BarOmega': bar_omega,
        'n_samples': len(X),
        'n_features': X.shape[1],
        'n_clusters': len(np.unique(y)),
        'feature1_mean': X[:, 0].mean(),
        'feature1_std': X[:, 0].std(),
        'feature2_mean': X[:, 1].mean(),
        'feature2_std': X[:, 1].std(),
    }
    
    # Cluster sizes
    unique, counts = np.unique(y, return_counts=True)
    for cluster_id, count in zip(unique, counts):
        stats[f'cluster_{cluster_id}_size'] = count
    
    stats_data.append(stats)

# Create DataFrame and display
stats_df = pd.DataFrame(stats_data)
print("Dataset Statistics:")
display(stats_df)

## Generate Experimental Grid

Now let's generate a complete experimental grid for different parameter combinations.

In [None]:
# Define experimental grid
experiment_grid = {
    'bar_omega_variation': {
        'bar_omega': [0.0, 0.05, 0.1, 0.2, 0.3],
        'K': [3],
        'P': [5],
        'N': [1000]
    }
}

# Generate all datasets in the grid
print("Generating experimental grid...")
grid_results = generator.generate_experiment_grid(experiment_grid)

print(f"Generated {len(grid_results)} datasets")
print("\nFirst 5 datasets:")
for i, result in enumerate(grid_results[:5]):
    print(f"{i+1}. BarOmega={result['bar_omega']}, K={result['K']}, P={result['P']}, N={result['N']}")

## List All Available Datasets

Let's see what datasets are available in our data directory.

In [None]:
# List all available datasets
available_datasets = generator.list_datasets()

print(f"Found {len(available_datasets)} datasets:")

# Convert to DataFrame for better display
datasets_df = pd.DataFrame(available_datasets)
if not datasets_df.empty:
    datasets_df = datasets_df.drop('filepath', axis=1)  # Remove filepath for display
    display(datasets_df.head(10))
else:
    print("No datasets found.")

## Conclusion

In this notebook, we:

1. **Initialized** a synthetic data generator
2. **Generated** datasets with different overlap levels (BarOmega)
3. **Visualized** how overlap affects cluster separation
4. **Calculated** basic statistics for the datasets
5. **Created** an experimental grid for systematic analysis

The generated datasets will be used in the next notebook to test different clustering algorithms and evaluate their performance under various conditions.

### Key Observations:
- **BarOmega = 0.0**: Well-separated clusters, easy to distinguish
- **BarOmega = 0.1**: Slight overlap, still manageable
- **BarOmega = 0.3**: Moderate overlap, more challenging
- **BarOmega = 0.5**: Significant overlap, difficult clustering scenario

These datasets provide a controlled environment to test how clustering algorithms perform under different conditions of cluster separability.