# StrepSuis-AMRVirKM: K-Modes Clustering Analysis

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MK-vet/strepsuis-amrvirkm/blob/main/notebooks/cluster_analysis.ipynb)

## Overview

This notebook provides a user-friendly interface for running K-Modes clustering analysis on antimicrobial resistance (AMR) and virulence data. **No coding knowledge required!**

### What This Analysis Does

- **K-Modes Clustering**: Groups bacterial strains based on similar resistance and virulence profiles
- **Feature Importance**: Identifies which genes/traits are most important for distinguishing clusters
- **Association Rules**: Discovers co-occurrence patterns (e.g., "if gene A is present, gene B is also likely present")
- **Statistical Testing**: Performs rigorous statistical tests with confidence intervals
- **Professional Reports**: Generates publication-ready HTML and Excel reports

### How to Use This Notebook

1. **Run the installation cell** (installs the package from GitHub)
2. **Upload your data files** (MIC.csv, AMR_genes.csv, Virulence.csv)
3. **Run the analysis cell**
4. **Download results** (HTML report, Excel file, charts)

That's it! The notebook will guide you through each step.

---

## Step 1: Install StrepSuis-AMRVirKM

This cell installs the package directly from GitHub. **No code duplication** - the package is downloaded and installed dynamically.

**Note**: This may take 1-2 minutes.

In [None]:
%%capture
# Install the package from GitHub (no code duplication)
!pip install git+https://github.com/MK-vet/strepsuis-amrvirkm.git

# Verify installation
import strepsuis_amrvirkm
print(f"✓ StrepSuis-AMRVirKM v{strepsuis_amrvirkm.__version__} installed successfully!")

## Step 2: Upload Your Data Files

Upload your CSV files using the file upload button below.

### Required Files:
1. **MIC.csv** - Minimum Inhibitory Concentration data (resistance phenotypes)
2. **AMR_genes.csv** - Antimicrobial resistance genes
3. **Virulence.csv** - Virulence factors

### File Format Requirements:
- First column must be named `Strain_ID`
- All other columns must contain binary values (0 or 1)
- 0 = absence, 1 = presence
- No missing values

**Don't have data?** Download example files from the repository.

In [None]:
from google.colab import files
import os

# Create data directory
os.makedirs('data', exist_ok=True)

print("Please upload your CSV files:")
print("  1. MIC.csv")
print("  2. AMR_genes.csv")
print("  3. Virulence.csv")
print()

uploaded = files.upload()

# Move files to data directory
for filename in uploaded.keys():
    os.rename(filename, f'data/{filename}')
    print(f"✓ {filename} uploaded successfully")

# Verify files
required_files = ['MIC.csv', 'AMR_genes.csv', 'Virulence.csv']
missing_files = [f for f in required_files if not os.path.exists(f'data/{f}')]

if missing_files:
    print(f"\n⚠ Warning: Missing files: {', '.join(missing_files)}")
else:
    print("\n✓ All required files uploaded!")

## Step 3: Configure Analysis Parameters (Optional)

You can customize the analysis parameters below, or use the default values.

**Recommended settings:**
- For quick analysis: `max_clusters=8`, `bootstrap=200`
- For publication: `max_clusters=10`, `bootstrap=500`

In [None]:
# Analysis parameters
MAX_CLUSTERS = 8           # Maximum number of clusters to test (2-10)
BOOTSTRAP_ITERATIONS = 200 # Number of bootstrap resamples (200-1000)
MIN_SUPPORT = 0.3          # Minimum support for association rules (0.1-0.5)
MIN_CONFIDENCE = 0.5       # Minimum confidence for association rules (0.5-0.9)
RANDOM_SEED = 42           # Random seed for reproducibility

print("Analysis Configuration:")
print(f"  Max clusters: {MAX_CLUSTERS}")
print(f"  Bootstrap iterations: {BOOTSTRAP_ITERATIONS}")
print(f"  Min support: {MIN_SUPPORT}")
print(f"  Min confidence: {MIN_CONFIDENCE}")
print(f"  Random seed: {RANDOM_SEED}")

## Step 4: Run Analysis

This cell runs the complete clustering analysis. **This may take 5-15 minutes** depending on:
- Dataset size
- Number of bootstrap iterations
- Colab resources available

You'll see progress messages as the analysis runs.

In [None]:
from strepsuis_amrvirkm import ClusterAnalyzer
import os

# Create output directory
os.makedirs('output', exist_ok=True)

print("="*60)
print("StrepSuis-AMRVirKM Cluster Analysis")
print("="*60)
print()

# Initialize analyzer
analyzer = ClusterAnalyzer(
    data_dir='data',
    output_dir='output',
    max_clusters=MAX_CLUSTERS,
    bootstrap_iterations=BOOTSTRAP_ITERATIONS,
    min_support=MIN_SUPPORT,
    min_confidence=MIN_CONFIDENCE,
    random_seed=RANDOM_SEED
)

# Run analysis
print("Running analysis...")
results = analyzer.run()

# Generate reports
print("\nGenerating reports...")
html_file = analyzer.generate_html_report(results)
excel_file = analyzer.generate_excel_report(results)

print("\n" + "="*60)
print("✓ Analysis Complete!")
print("="*60)
print(f"\nResults saved to:")
print(f"  - HTML report: {html_file}")
print(f"  - Excel report: {excel_file}")
print(f"  - Charts: output/png_charts/")
print("\nProceed to the next cell to download results.")

## Step 5: Download Results

Download all results as a ZIP file for easy transfer.

In [None]:
from google.colab import files
import shutil

# Create ZIP file
print("Creating ZIP file...")
shutil.make_archive('results', 'zip', 'output')

# Download ZIP
print("Downloading results...")
files.download('results.zip')

print("\n✓ Download complete!")
print("\nExtract the ZIP file to access:")
print("  - Interactive HTML report")
print("  - Excel workbook with all data")
print("  - High-quality PNG charts")

## Optional: Preview Results in Colab

View a summary of results directly in the notebook.

In [None]:
# Display HTML report in iframe
from IPython.display import IFrame

print("Preview of HTML report:")
IFrame(src=html_file, width=900, height=600)

---

## Support

- **Documentation**: [GitHub Repository](https://github.com/MK-vet/strepsuis-amrvirkm)
- **Issues**: [Report a bug](https://github.com/MK-vet/strepsuis-amrvirkm/issues)
- **Examples**: [Example datasets](https://github.com/MK-vet/strepsuis-amrvirkm/tree/main/examples)

## Citation

If you use this tool in your research, please cite:

```
StrepSuis-AMRVirKM: K-Modes Clustering of Antimicrobial Resistance and Virulence Profiles
MK-vet (2025). https://github.com/MK-vet/strepsuis-amrvirkm
```

---

**License**: MIT  
**Version**: 1.0.0  
**Last Updated**: 2025-01-14