# Co-Location Pattern Mining Demo

This notebook demonstrates two approaches for co-location pattern mining:
1. **Synthetic Data Generation** - Generate test data with known patterns
2. **CSV File Loading** - Load real data from CSV files

Choose one of the methods below based on your needs.

In [14]:
import sys
from pathlib import Path

# For Jupyter Notebook, use the current working directory instead of __file__
sys.path.insert(0, str(Path.cwd().parent))

from colocation.synthetic import GeneratorParams, SyntheticSpatialGenerator
from data.data import SpatialDataset
from colocation.miner import CoLocationMiner

## Setup: Import Required Libraries

In [16]:

## Method 1: Using Synthetic Data (Generated) - COMMENTED OUT

## Generate synthetic spatial data with controlled parameters for testing algorithms.

## **Note:** This section is commented out. Uncomment to use synthetic data instead of CSV.

In [17]:
# # Configure synthetic data generation parameters
# params = GeneratorParams(
#     P=10,        # Number of prevalent patterns
#     I=200,       # Instances per feature
#     D=5000.0,    # Space dimension
#     F=10,        # Number of features
#     Q=3,         # Pattern size
#     m=10000,     # Total instances
#     min_dist=50.0,  # Minimum distance threshold
#     clumpy=1,    # Clumpiness level (1=sparse, 2-3=denser)
# )
# 
# # Generate synthetic dataset
# gen = SyntheticSpatialGenerator(params, seed=42)
# ds = gen.generate()
# 
# print(f"✓ Generated {len(ds.instances)} instances")
# print(f"✓ Feature distribution: {ds.feature_counts()}")

---

## Method 2: Using CSV File (Real Data) - ACTIVE

Load spatial data from a CSV file.

**CSV Format Required:**
```
InstanceID,Feature,X,Y
1,A,10,10
1,B,12,11
2,A,20,20
...
```

In [18]:
# Load data from CSV file
csv_path = Path.cwd().parent / "data" / "sample_data.csv"
ds = SpatialDataset.from_csv(str(csv_path))

print(f"✓ Loaded {len(ds.instances)} instances from CSV")
print(f"✓ Feature distribution: {ds.feature_counts()}")

✓ Loaded 30 instances from CSV
✓ Feature distribution: {'A': 6, 'B': 6, 'C': 6, 'D': 6, 'E': 6}


---

## Run Co-Location Mining Algorithms

Apply both IDS (Instance-Data-Structure) and NDS (Neighbor-Data-Structure) approaches.

In [19]:
# Configure mining parameters
min_dist = 5.0   # Minimum neighbor distance (adjusted for CSV data scale)
min_prev = 0.2   # Minimum prevalence threshold (0-1)

# Initialize miner
miner = CoLocationMiner(
    dataset=ds,
    min_dist=min_dist,
    min_prev=min_prev
)

print("Running co-location mining algorithms...")
print(f"  - Distance threshold: {min_dist}")
print(f"  - Prevalence threshold: {min_prev}")

# Run both algorithms
cliques_ids, prev_ids = miner.run_ids()  # IDS approach
cliques_nds, prev_nds = miner.run_nds()  # NDS approach

print("✓ Mining completed!")

Running co-location mining algorithms...
  - Distance threshold: 5.0
  - Prevalence threshold: 0.2
✓ Mining completed!


In [20]:
print(f"Number of cliques (IDS): {len(cliques_ids)}")
print(f"\nSample cliques (first 5):")
for i, clique in enumerate(list(cliques_ids)[:5], 1):
    print(f"  {i}. {clique}")

Number of cliques (IDS): 25

Sample cliques (first 5):
  1. (Instance(feature='A', index=1, x=10.0, y=10.0), Instance(feature='D', index=1, x=13.0, y=12.0))
  2. (Instance(feature='A', index=1, x=10.0, y=10.0), Instance(feature='B', index=1, x=12.0, y=11.0), Instance(feature='D', index=1, x=13.0, y=12.0))
  3. (Instance(feature='A', index=1, x=10.0, y=10.0), Instance(feature='C', index=1, x=11.0, y=13.0), Instance(feature='D', index=1, x=13.0, y=12.0))
  4. (Instance(feature='A', index=1, x=10.0, y=10.0), Instance(feature='B', index=1, x=12.0, y=11.0), Instance(feature='C', index=1, x=11.0, y=13.0), Instance(feature='D', index=1, x=13.0, y=12.0))
  5. (Instance(feature='A', index=2, x=20.0, y=20.0), Instance(feature='C', index=2, x=19.0, y=21.0))


---

## Results: IDS Approach (Instance-Data-Structure)

In [21]:
print(f"Number of cliques (NDS): {len(cliques_nds)}")
print(f"\nSample cliques (first 5):")
for i, clique in enumerate(list(cliques_nds)[:5], 1):
    print(f"  {i}. {clique}")

Number of cliques (NDS): 17

Sample cliques (first 5):
  1. (Instance(feature='A', index=1, x=10.0, y=10.0), Instance(feature='B', index=1, x=12.0, y=11.0), Instance(feature='C', index=1, x=11.0, y=13.0), Instance(feature='D', index=1, x=13.0, y=12.0))
  2. (Instance(feature='A', index=1, x=10.0, y=10.0), Instance(feature='B', index=1, x=12.0, y=11.0), Instance(feature='D', index=1, x=13.0, y=12.0))
  3. (Instance(feature='A', index=2, x=20.0, y=20.0), Instance(feature='B', index=2, x=21.0, y=22.0), Instance(feature='C', index=2, x=19.0, y=21.0))
  4. (Instance(feature='A', index=3, x=40.0, y=40.0), Instance(feature='B', index=3, x=42.0, y=41.0), Instance(feature='C', index=3, x=41.0, y=39.0))
  5. (Instance(feature='A', index=4, x=60.0, y=60.0), Instance(feature='B', index=4, x=61.0, y=62.0), Instance(feature='D', index=2, x=62.0, y=60.0))


## Results: NDS Approach (Neighbor-Data-Structure)

In [22]:
print(f"Number of prevalent patterns (IDS): {len(prev_ids)}")
print(f"\nSample prevalent patterns (first 5):")
for i, pattern in enumerate(list(prev_ids)[:5], 1):
    print(f"  {i}. {pattern}")

Number of prevalent patterns (IDS): 17

Sample prevalent patterns (first 5):
  1. frozenset({'C'})
  2. frozenset({'B'})
  3. frozenset({'C', 'B'})
  4. frozenset({'A'})
  5. frozenset({'C', 'A'})


## Prevalent Patterns

Patterns that meet the minimum prevalence threshold.

In [23]:
print(f"Number of prevalent patterns (NDS): {len(prev_nds)}")
print(f"\nSample prevalent patterns (first 5):")
for i, pattern in enumerate(list(prev_nds)[:5], 1):
    print(f"  {i}. {pattern}")

Number of prevalent patterns (NDS): 17

Sample prevalent patterns (first 5):
  1. frozenset({'C'})
  2. frozenset({'B'})
  3. frozenset({'C', 'B'})
  4. frozenset({'A'})
  5. frozenset({'C', 'A'})


---

## Summary

**Key Differences:**
- **Synthetic Data**: Generated with controlled parameters, useful for testing and validation
- **CSV Data**: Real-world data, requires proper formatting (InstanceID, Feature, X, Y columns)

**How to Switch:**
1. For synthetic data: Use Method 1 (current default)
2. For CSV data: Comment out Method 1 cells, uncomment Method 2 cell
3. Adjust `min_dist` parameter based on your data scale