# Cluster Analysis for MIC, AMR, and Virulence Data

This notebook provides comprehensive clustering analysis for microbial genomics data.

## Features:
- K-Modes clustering with automatic cluster selection
- Multiple Correspondence Analysis (MCA)
- Statistical tests (Chi-square, Fisher exact)
- Feature importance analysis
- Association rule mining
- HTML and Excel report generation

## Requirements:
- Binary data (0 = absence, 1 = presence)
- CSV files with Strain_ID column

---

## 1. Setup and Installation

Install required packages for analysis:

In [None]:
%%capture
# Install required packages
!pip install kmodes prince ydata-profiling joblib numba tqdm psutil statsmodels jinja2 plotly openpyxl kaleido xlsxwriter

In [None]:
# Import libraries
import sys
import os
import gc
import traceback
import logging
import multiprocessing
from google.colab import files

import numpy as np
import pandas as pd
import psutil
from tqdm import tqdm
from joblib import Parallel, delayed
from numba import jit

# K-Modes
from kmodes.kmodes import KModes

# Statistics
from sklearn.utils import resample
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from scipy.stats import chi2_contingency, fisher_exact

# Visualization
import plotly.express as px
from prince import MCA

# Data Profiling
from ydata_profiling import ProfileReport

# Jinja2 for HTML templating
import jinja2
import base64
import io

print("✓ All packages imported successfully")

## 2. Download Utility Files

Download the Excel report utility module from the repository:

In [None]:
# Download excel_report_utils.py from GitHub
!wget -q https://raw.githubusercontent.com/MK-vet/MKrep/main/excel_report_utils.py

# Import the utility module
from excel_report_utils import ExcelReportGenerator, sanitize_sheet_name

print("✓ Excel report utilities loaded")

## 3. Configuration

Set up logging and configuration parameters:

In [None]:
# Logging configuration
logging.basicConfig(
    filename="analysis.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Global settings
sys.setrecursionlimit(10000)
gc.enable()
gc.set_threshold(100, 5, 5)
np.seterr(all='ignore')
np.random.seed(42)

# Output folder
output_folder = "clustering_analysis_results"
os.makedirs(output_folder, exist_ok=True)

print(f"✓ Configuration complete. Output folder: {output_folder}")

## 4. Upload Data Files

Upload your CSV files containing:
- MIC.csv (Minimum Inhibitory Concentration data)
- AMR_genes.csv (Antimicrobial resistance genes)
- Virulence.csv (Virulence factors)
- MLST.csv (Multi-locus sequence typing)
- Serotype.csv (Serological types)
- Plasmid.csv (Plasmid presence/absence)
- MGE.csv (Mobile genetic elements)

**Note:** All data should be binary (0 = absence, 1 = presence) with a 'Strain_ID' column.

In [None]:
# Upload files
print("Please upload your CSV data files:")
uploaded = files.upload()

# Verify uploaded files
print("\nUploaded files:")
for filename in uploaded.keys():
    print(f"  - {filename}")

## 5. Download Main Analysis Script

Get the main analysis script from GitHub:

In [None]:
!wget -q https://raw.githubusercontent.com/MK-vet/MKrep/main/Cluster_MIC_AMR_Viruelnce.py

print("✓ Analysis script downloaded")

## 6. Run Analysis

Execute the clustering analysis. This will:
1. Load and validate all data files
2. Perform K-Modes clustering for each data category
3. Calculate statistical associations (Chi-square, Fisher exact)
4. Compute log-odds ratios with bootstrap confidence intervals
5. Perform Multiple Correspondence Analysis (MCA)
6. Generate feature importance rankings
7. Mine association rules
8. Create comprehensive HTML and Excel reports

**Note:** This may take several minutes depending on data size.

In [None]:
# Load and execute the script
exec(open('Cluster_MIC_AMR_Viruelnce.py').read())

# Run main analysis
if __name__ == '__main__':
    main()

## 7. Download Results

Download the generated reports and visualizations:

In [None]:
import glob
import zipfile

# Create a zip file with all results
zip_filename = "cluster_analysis_results.zip"
with zipfile.ZipFile(zip_filename, 'w') as zipf:
    # Add all files from output folder
    for root, dirs, files_list in os.walk(output_folder):
        for file in files_list:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, '.')
            zipf.write(file_path, arcname)

print(f"Results packaged in {zip_filename}")

# Download the zip file
files.download(zip_filename)

---

## Analysis Complete!

Your reports include:
- **HTML Report**: Interactive tables with DataTables (sorting, filtering, export)
- **Excel Report**: Multi-sheet workbook with all results and methodology
- **PNG Charts**: High-quality visualizations for publications

### Interpretation Guide:

**Binary Data:**
- 0 = Absence of feature (gene, resistance, virulence factor)
- 1 = Presence of feature
- Both absence and presence are biologically significant

**Key Results:**
- **Clusters**: Groups of strains with similar profiles
- **Chi-square p-values**: Statistical significance of feature-cluster associations (p < 0.05 after FDR correction)
- **Log-odds ratios**: Effect size of associations (positive = enrichment, negative = depletion)
- **Feature importance**: Random Forest-based ranking of discriminative features
- **Association rules**: Co-occurring patterns (support, confidence, lift metrics)

### Citation:
If you use this analysis in your research, please cite the repository:
```
MK-vet/MKrep: Comprehensive bioinformatics analysis pipeline for microbial genomics
https://github.com/MK-vet/MKrep
```

---