# Medical Device Classification Benchmark Dataset Preparation

This notebook guides you through preparing a comprehensive benchmark dataset for medical device-based classification and model development.

## Overview
- **Objective**: Create a standardized benchmark dataset for medical device classification
- **Focus**: Ensure data quality, balanced classes, and proper evaluation splits
- **Output**: Clean, well-documented dataset ready for machine learning model development

## Notebook Sections
1. Import Required Libraries
2. Load and Explore Medical Device Dataset  
3. Data Preprocessing and Cleaning
4. Feature Engineering for Medical Devices
5. Class Distribution Analysis
6. Data Splitting and Stratification
7. Dataset Validation and Quality Checks
8. Export Benchmark Dataset

## 1. Import Required Libraries

Import necessary libraries including pandas, numpy, scikit-learn, and visualization tools for dataset preparation.

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Data preprocessing and machine learning
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Statistical analysis
from scipy import stats
from collections import Counter

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Configuration
plt.style.use('default')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✅ All libraries imported successfully!")
print(f"📊 Pandas version: {pd.__version__}")
print(f"🔢 NumPy version: {np.__version__}")
print(f"🎯 Random seed set to: {RANDOM_SEED}")

## 2. Load and Explore Medical Device Dataset

Load the raw medical device data and perform initial exploration to understand the structure, features, and target classes.

In [None]:
# Create sample medical device dataset for demonstration
# In practice, replace this with your actual data loading code

def create_sample_medical_device_data(n_samples=1000):
    """
    Create a sample medical device dataset for demonstration purposes.
    Replace this with your actual data loading function.
    """
    
    # Define medical device categories
    categories = [
        'Diagnostic Imaging', 'Surgical Instruments', 'Monitoring Equipment',
        'Therapeutic Devices', 'Prosthetics', 'Implants', 'Laboratory Equipment',
        'Emergency Equipment', 'Rehabilitation Devices', 'Dental Equipment'
    ]
    
    # Define manufacturers
    manufacturers = ['MedTech Corp', 'HealthDevices Inc', 'BioMed Solutions', 
                    'MediCare Systems', 'TechHealth Ltd', 'Advanced Medical']
    
    # Generate synthetic data
    np.random.seed(RANDOM_SEED)
    data = []
    
    for i in range(n_samples):
        category = np.random.choice(categories)
        manufacturer = np.random.choice(manufacturers)
        
        # Create realistic features based on category
        if category == 'Diagnostic Imaging':
            power_rating = np.random.normal(150, 30)
            weight = np.random.normal(500, 100)
            price = np.random.normal(50000, 15000)
        elif category == 'Surgical Instruments':
            power_rating = np.random.normal(20, 5)
            weight = np.random.normal(2, 0.5)
            price = np.random.normal(5000, 2000)
        elif category == 'Monitoring Equipment':
            power_rating = np.random.normal(50, 10)
            weight = np.random.normal(10, 3)
            price = np.random.normal(15000, 5000)
        else:
            power_rating = np.random.normal(75, 25)
            weight = np.random.normal(50, 20)
            price = np.random.normal(20000, 8000)
        
        # Ensure positive values
        power_rating = max(0, power_rating)
        weight = max(0, weight)
        price = max(1000, price)
        
        data.append({
            'device_id': f'DEV_{i:05d}',
            'device_name': f'{category} Device {i}',
            'category': category,
            'manufacturer': manufacturer,
            'power_rating_watts': round(power_rating, 2),
            'weight_kg': round(weight, 2),
            'price_usd': round(price, 2),
            'fda_approved': np.random.choice([True, False], p=[0.8, 0.2]),
            'year_manufactured': np.random.randint(2015, 2024),
            'maintenance_required': np.random.choice(['Low', 'Medium', 'High'], p=[0.5, 0.3, 0.2])
        })
    
    return pd.DataFrame(data)

# Load or create dataset
print("📂 Loading medical device dataset...")
df = create_sample_medical_device_data(n_samples=1000)

print(f"✅ Dataset loaded successfully!")
print(f"📊 Dataset shape: {df.shape}")
print(f"🏷️ Columns: {list(df.columns)}")

# Display first few rows
print("\\n🔍 First 5 rows:")
df.head()