# üß¨ MKrep: Interactive Microbial Genomics Analysis

**No coding required! Just upload your data and click buttons to analyze.**

This notebook provides an interactive, user-friendly interface for analyzing bacterial genomic data:
- **Cluster Analysis**: Discover patterns in MIC, AMR, and virulence data
- **MDR Analysis**: Identify multi-drug resistance patterns
- **Network Analysis**: Explore relationships between genetic features
- **Phylogenetic Analysis**: Study evolutionary relationships

---

## üöÄ How to Use This Notebook:

1. **Run Setup** (Section 1-2): Click the play button (‚ñ∂) on the first few cells
2. **Upload Your Data** (Section 3): Use the file upload button to select your CSV files
3. **Select Analysis Type** (Section 4): Choose which analysis you want to run
4. **Configure Parameters** (Section 5): Adjust settings using sliders and dropdowns
5. **Run Analysis** (Section 6): Click the "Run Analysis" button
6. **View Results** (Section 7): See visualizations and download reports

---

### üìã Data Format Requirements:

Your CSV files should have:
- First column: `Strain_ID` (unique identifier)
- Other columns: Binary features (0 = absent, 1 = present)

**Example:**
```
Strain_ID,Gene1,Gene2,Antibiotic1
Strain001,1,0,1
Strain002,0,1,0
```


## 1. Setup and Installation

**Click the ‚ñ∂ button to run this cell** (this will take 1-2 minutes)

In [None]:
#@title üì¶ Install Required Packages { display-mode: "form" }

import sys
import subprocess
import os

print("üîß Installing dependencies...")
print("This may take 1-2 minutes. Please wait...\n")

# Install required packages
packages = [
    "pandas>=1.3.0",
    "numpy>=1.21.0",
    "scipy>=1.7.0",
    "matplotlib>=3.4.0",
    "seaborn>=0.11.0",
    "plotly>=5.0.0",
    "scikit-learn>=1.0.0",
    "biopython>=1.79",
    "networkx>=2.6.0",
    "openpyxl>=3.0.0",
    "kaleido>=0.2.0",
    "ipywidgets>=8.0.0",
    "kmodes>=0.12.0",
    "mlxtend>=0.19.0",
    "prince>=0.7.0",
    "umap-learn>=0.5.0"
]

subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + packages)

print("‚úÖ All packages installed successfully!\n")
print("Now proceed to the next cell to load utilities.")

## 2. Load Analysis Tools

**Click the ‚ñ∂ button to run this cell**

In [None]:
#@title üìö Load Libraries and Download Scripts { display-mode: "form" }

import warnings
warnings.filterwarnings('ignore')

import io
import base64
import zipfile
from datetime import datetime
import gc
import logging
import json
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output, Image, Javascript

print("üì• Downloading analysis scripts from GitHub...\n")

# Download required scripts from GitHub
repo_base = "https://raw.githubusercontent.com/MK-vet/MKrep/main/"
scripts_to_download = [
    "excel_report_utils.py",
    "report_templates.py",
    "Cluster_MIC_AMR_Viruelnce.py",
    "MDR_2025_04_15.py",
    "Network_Analysis_2025_06_26.py",
    "Phylgenetic_clustering_2025_03_21.py",
    "StrepSuisPhyloCluster_2025_08_11.py"
]

for script in scripts_to_download:
    try:
        subprocess.check_call(["wget", "-q", "-O", script, repo_base + script])
        print(f"‚úì Downloaded {script}")
    except:
        print(f"‚ö† Could not download {script} (might not exist)")

# Create output directory
output_dir = "analysis_results"
os.makedirs(output_dir, exist_ok=True)

print("\n‚úÖ All tools loaded successfully!")
print("Now proceed to upload your data files.")

## 3. Upload Your Data Files

**Click the ‚ñ∂ button and then click "Choose Files" to upload your CSV files**

In [None]:
#@title üìÇ Upload Data Files { display-mode: "form" }

from google.colab import files

# Global storage for uploaded files
uploaded_files = {}
file_dataframes = {}

# File upload widget
print("üì§ Please upload your CSV files:")
print("\nSupported files:")
print("  ‚Ä¢ MIC.csv - Minimum Inhibitory Concentration data")
print("  ‚Ä¢ AMR_genes.csv - Antimicrobial resistance genes")
print("  ‚Ä¢ Virulence.csv - Virulence factors")
print("  ‚Ä¢ MLST.csv - Multi-locus sequence typing")
print("  ‚Ä¢ Serotype.csv - Serological types")
print("  ‚Ä¢ Plasmid.csv - Plasmid data")
print("  ‚Ä¢ MGE.csv - Mobile genetic elements")
print("  ‚Ä¢ tree.newick - Phylogenetic tree (for phylo analysis)")
print("\n‚ö†Ô∏è All CSV files must have a 'Strain_ID' column!")
print("‚ö†Ô∏è Data should be binary: 0 = absent, 1 = present\n")

uploaded = files.upload()

print("\nüìä Processing uploaded files...\n")

for filename, content in uploaded.items():
    uploaded_files[filename] = content
    
    # Try to read as dataframe if CSV
    if filename.endswith('.csv'):
        try:
            df = pd.read_csv(io.BytesIO(content))
            file_dataframes[filename] = df
            print(f"‚úì {filename}: {len(df)} rows, {len(df.columns)} columns")
            
            # Check for Strain_ID column
            if 'Strain_ID' not in df.columns:
                print(f"  ‚ö†Ô∏è Warning: Missing 'Strain_ID' column!")
        except Exception as e:
            print(f"‚úó {filename}: Error reading file - {str(e)}")
    else:
        print(f"‚úì {filename}: Uploaded (non-CSV file)")

print(f"\n‚úÖ Successfully uploaded {len(uploaded_files)} file(s)")
print("Now proceed to select your analysis type.")

## 4. Select Analysis Type

**Click the ‚ñ∂ button and choose your analysis from the dropdown**

In [None]:
#@title üî¨ Choose Analysis Type { display-mode: "form" }

# Analysis configuration
analysis_config = {
    'cluster': {
        'name': 'Cluster Analysis',
        'description': 'Discover patterns using K-Modes clustering',
        'required_files': ['MIC.csv', 'AMR_genes.csv', 'Virulence.csv'],
        'optional_files': ['MLST.csv', 'Serotype.csv', 'Plasmid.csv', 'MGE.csv']
    },
    'mdr': {
        'name': 'MDR Analysis',
        'description': 'Analyze multi-drug resistance patterns',
        'required_files': ['MIC.csv', 'AMR_genes.csv'],
        'optional_files': ['Virulence.csv', 'MLST.csv']
    },
    'network': {
        'name': 'Network Analysis',
        'description': 'Explore feature associations and networks',
        'required_files': ['MGE.csv', 'MIC.csv', 'MLST.csv', 'Plasmid.csv', 'Serotype.csv', 'Virulence.csv', 'AMR_genes.csv'],
        'optional_files': []
    },
    'phylo': {
        'name': 'Phylogenetic Clustering',
        'description': 'Tree-based clustering with evolutionary metrics',
        'required_files': ['tree.newick', 'MIC.csv', 'AMR_genes.csv', 'Virulence.csv'],
        'optional_files': ['MLST.csv', 'Serotype.csv']
    },
    'strepsuis': {
        'name': 'Streptococcus suis Analysis',
        'description': 'Specialized analysis for S. suis',
        'required_files': ['tree.newick', 'MIC.csv', 'AMR_genes.csv', 'Virulence.csv'],
        'optional_files': ['MLST.csv', 'Serotype.csv']
    }
}

# Store selected analysis globally
selected_analysis_type = None

# Create dropdown widget
analysis_dropdown = widgets.Dropdown(
    options=[(config['name'], key) for key, config in analysis_config.items()],
    description='Analysis:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='500px')
)

# Info display
info_output = widgets.Output()

def on_analysis_change(change):
    global selected_analysis_type
    selected_analysis_type = change['new']
    config = analysis_config[selected_analysis_type]
    
    with info_output:
        clear_output()
        print(f"\nüìã {config['name']}")
        print(f"   {config['description']}\n")
        print("Required files:")
        for f in config['required_files']:
            status = "‚úì" if f in uploaded_files else "‚úó"
            print(f"  {status} {f}")
        if config['optional_files']:
            print("\nOptional files:")
            for f in config['optional_files']:
                status = "‚úì" if f in uploaded_files else "‚óã"
                print(f"  {status} {f}")
        
        # Check if can run
        missing = [f for f in config['required_files'] if f not in uploaded_files]
        if missing:
            print(f"\n‚ö†Ô∏è Missing required files: {', '.join(missing)}")
            print("Please go back and upload the required files.")
        else:
            print("\n‚úÖ All required files uploaded!")
            print("Proceed to configure parameters.")

analysis_dropdown.observe(on_analysis_change, names='value')

# Display widgets
display(analysis_dropdown)
display(info_output)

# Trigger initial display
if analysis_dropdown.value:
    on_analysis_change({'new': analysis_dropdown.value})

## 5. Configure Analysis Parameters

**Click the ‚ñ∂ button and adjust parameters using the controls below**

In [None]:
#@title ‚öôÔ∏è Analysis Parameters { display-mode: "form" }

# Create parameter widgets
max_clusters_slider = widgets.IntSlider(
    value=8,
    min=2,
    max=15,
    step=1,
    description='Max Clusters:',
    style={'description_width': '150px'},
    layout=widgets.Layout(width='500px')
)

bootstrap_slider = widgets.IntSlider(
    value=500,
    min=100,
    max=2000,
    step=100,
    description='Bootstrap Iterations:',
    style={'description_width': '150px'},
    layout=widgets.Layout(width='500px')
)

mdr_threshold_slider = widgets.IntSlider(
    value=3,
    min=2,
    max=10,
    step=1,
    description='MDR Threshold:',
    style={'description_width': '150px'},
    layout=widgets.Layout(width='500px')
)

fdr_alpha_slider = widgets.FloatSlider(
    value=0.05,
    min=0.01,
    max=0.2,
    step=0.01,
    description='FDR Alpha:',
    style={'description_width': '150px'},
    layout=widgets.Layout(width='500px')
)

random_seed_input = widgets.IntText(
    value=42,
    description='Random Seed:',
    style={'description_width': '150px'},
    layout=widgets.Layout(width='300px')
)

# Display widgets
print("Configure your analysis parameters:\n")
display(max_clusters_slider)
display(bootstrap_slider)
display(mdr_threshold_slider)
display(fdr_alpha_slider)
display(random_seed_input)

print("\nüí° Tip: Hover over each slider for more information")
print("Now proceed to run the analysis!")

## 6. Run Analysis

**Click the ‚ñ∂ button to start the analysis**

This may take several minutes depending on the size of your data and the selected parameters.

In [None]:
#@title üöÄ Run Analysis { display-mode: "form" }

import time
import traceback

# Progress display
progress_output = widgets.Output()
progress_bar = widgets.IntProgress(
    value=0,
    min=0,
    max=100,
    description='Progress:',
    bar_style='info',
    style={'bar_color': '#0066cc'},
    layout=widgets.Layout(width='100%')
)

status_label = widgets.Label(value="Ready to start...")

display(status_label)
display(progress_bar)
display(progress_output)

# Check if analysis is selected and files are uploaded
if not selected_analysis_type:
    with progress_output:
        print("‚ö†Ô∏è Please select an analysis type in the previous section")
else:
    config = analysis_config[selected_analysis_type]
    missing = [f for f in config['required_files'] if f not in uploaded_files]
    
    if missing:
        with progress_output:
            print(f"‚ö†Ô∏è Missing required files: {', '.join(missing)}")
            print("Please go back and upload the required files.")
    else:
        # Run analysis
        with progress_output:
            print(f"\nüî¨ Starting {config['name']}...\n")
            status_label.value = "Initializing..."
            progress_bar.value = 10
            
            try:
                # Set random seed
                np.random.seed(random_seed_input.value)
                
                # Create timestamp for outputs
                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
                analysis_output_dir = f"{output_dir}/{selected_analysis_type}_{timestamp}"
                os.makedirs(analysis_output_dir, exist_ok=True)
                
                status_label.value = "Loading data..."
                progress_bar.value = 20
                print("üìÇ Loading data files...")
                
                # Save uploaded files to disk for analysis scripts
                for filename, content in uploaded_files.items():
                    with open(filename, 'wb') as f:
                        f.write(content)
                    print(f"  ‚úì {filename}")
                
                status_label.value = "Running analysis..."
                progress_bar.value = 30
                print("\nüîÑ Executing analysis (this may take several minutes)...\n")
                
                # Run the appropriate analysis script
                if selected_analysis_type == 'cluster':
                    script_name = 'Cluster_MIC_AMR_Viruelnce.py'
                elif selected_analysis_type == 'mdr':
                    script_name = 'MDR_2025_04_15.py'
                elif selected_analysis_type == 'network':
                    script_name = 'Network_Analysis_2025_06_26.py'
                elif selected_analysis_type == 'phylo':
                    script_name = 'Phylgenetic_clustering_2025_03_21.py'
                elif selected_analysis_type == 'strepsuis':
                    script_name = 'StrepSuisPhyloCluster_2025_08_11.py'
                
                # Execute the script
                result = subprocess.run(
                    [sys.executable, script_name],
                    capture_output=True,
                    text=True,
                    timeout=1800  # 30 minutes timeout
                )
                
                progress_bar.value = 80
                status_label.value = "Processing results..."
                
                if result.returncode == 0:
                    print("‚úÖ Analysis completed successfully!\n")
                    progress_bar.value = 100
                    progress_bar.bar_style = 'success'
                    status_label.value = "Analysis complete!"
                    
                    # Show output
                    if result.stdout:
                        print("üìä Analysis output:")
                        print(result.stdout[-500:])
                    
                    print("\n‚ú® Results are ready! Proceed to the next section to view and download.")
                else:
                    print(f"‚ö†Ô∏è Analysis completed with warnings.")
                    print(f"\nOutput: {result.stdout[-500:]}")
                    print(f"\nErrors: {result.stderr[-500:]}")
                    progress_bar.bar_style = 'warning'
                    status_label.value = "Completed with warnings"
                    
            except subprocess.TimeoutExpired:
                print("‚ö†Ô∏è Analysis timed out after 30 minutes.")
                print("Try reducing the dataset size or bootstrap iterations.")
                progress_bar.bar_style = 'danger'
                status_label.value = "Timeout"
                
            except Exception as e:
                print(f"‚ùå Error during analysis: {str(e)}")
                print("\nFull error:")
                traceback.print_exc()
                progress_bar.bar_style = 'danger'
                status_label.value = "Error"


## 7. View and Download Results

**Click the ‚ñ∂ button to view your results and download reports**

In [None]:
#@title üìä View Results { display-mode: "form" }

results_output = widgets.Output()

with results_output:
    print("\nüìÅ Generated Files:\n")
    
    # Find generated files
    html_files = [f for f in os.listdir('.') if f.endswith('.html')]
    xlsx_files = [f for f in os.listdir('.') if f.endswith('.xlsx')]
    png_dir = 'png_charts'
    
    if html_files:
        print("üåê HTML Reports:")
        for f in html_files:
            size = os.path.getsize(f) / 1024
            print(f"  ‚Ä¢ {f} ({size:.1f} KB)")
    
    if xlsx_files:
        print("\nüìä Excel Reports:")
        for f in xlsx_files:
            size = os.path.getsize(f) / 1024
            print(f"  ‚Ä¢ {f} ({size:.1f} KB)")
    
    if os.path.exists(png_dir):
        png_files = [f for f in os.listdir(png_dir) if f.endswith('.png')]
        if png_files:
            print(f"\nüñºÔ∏è Visualizations: {len(png_files)} charts in '{png_dir}/' directory")
    
    if not html_files and not xlsx_files:
        print("‚ö†Ô∏è No output files found. Please run the analysis first.")
    else:
        print("\n‚úÖ All results generated successfully!")
        print("\nUse the cell below to download files.")

display(results_output)

In [None]:
#@title üì• Download Results { display-mode: "form" }

from google.colab import files as colab_files

download_output = widgets.Output()

with download_output:
    print("üì¶ Preparing results for download...\n")
    
    # Find all result files
    html_files = [f for f in os.listdir('.') if f.endswith('.html')]
    xlsx_files = [f for f in os.listdir('.') if f.endswith('.xlsx')]
    
    if not html_files and not xlsx_files:
        print("‚ö†Ô∏è No results to download. Please run the analysis first.")
    else:
        # Create a zip file with all results
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        zip_filename = f"mkrep_results_{timestamp}.zip"
        
        with zipfile.ZipFile(zip_filename, 'w') as zipf:
            # Add HTML files
            for f in html_files:
                zipf.write(f)
                print(f"  ‚úì Added {f}")
            
            # Add Excel files
            for f in xlsx_files:
                zipf.write(f)
                print(f"  ‚úì Added {f}")
            
            # Add PNG charts if they exist
            png_dir = 'png_charts'
            if os.path.exists(png_dir):
                for root, dirs, files in os.walk(png_dir):
                    for file in files:
                        if file.endswith('.png'):
                            file_path = os.path.join(root, file)
                            arcname = os.path.join('png_charts', file)
                            zipf.write(file_path, arcname=arcname)
                print(f"  ‚úì Added PNG charts directory")
        
        print(f"\n‚úÖ Created {zip_filename}")
        print(f"   Size: {os.path.getsize(zip_filename) / (1024*1024):.2f} MB")
        print("\nüì• Downloading...")
        
        # Download the zip file
        colab_files.download(zip_filename)
        
        print("\n‚ú® Download complete!")
        print("Check your browser's download folder.")

display(download_output)

---

## üéâ Analysis Complete!

### What's Next?

1. **Review the HTML report** - Open the HTML file in your browser for interactive tables and charts
2. **Explore the Excel workbook** - Open the Excel file for detailed data tables and methodology
3. **View visualizations** - Check the PNG charts folder for high-quality publication-ready figures

### Need Help?

- üìö [User Guide](https://github.com/MK-vet/MKrep/blob/main/USER_GUIDE.md)
- üìñ [Interpretation Guide](https://github.com/MK-vet/MKrep/blob/main/INTERPRETATION_GUIDE.md)
- üêõ [Report Issues](https://github.com/MK-vet/MKrep/issues)

### Run Another Analysis

To run a different analysis:
1. Go back to Section 3 to upload new files (if needed)
2. Select a different analysis type in Section 4
3. Configure parameters in Section 5
4. Run the analysis in Section 6

---

**MKrep** - Comprehensive Microbial Genomics Analysis Pipeline  
¬© 2025 | [GitHub Repository](https://github.com/MK-vet/MKrep) | [MIT License](https://github.com/MK-vet/MKrep/blob/main/LICENSE)
