# üöÄ ML Project Framework - Getting Started

A comprehensive guide to using the production-ready ML Project Framework for end-to-end machine learning workflows.

This notebook demonstrates:

- Environment setup and dependency management
- Configuration management with YAML
- Complete data pipeline (load, clean, split)
- Feature engineering and preprocessing
- Model training and hyperparameter tuning
- Comprehensive model evaluation
- Experiment tracking and result management


## 1Ô∏è‚É£ Environment Setup and Virtual Environment Creation

First, let's set up the development environment with project isolation using a virtual environment.


In [None]:
# Display Python version and environment info
import sys
import os

print(f"Python Version: {sys.version}")
print(f"Python Executable: {sys.executable}")
print(f"Current Working Directory: {os.getcwd()}")

### Virtual Environment Setup (Optional - do this in terminal)

```bash
# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate
```


## 2Ô∏è‚É£ Install Dependencies and Package Installation


In [None]:
# Install required packages
import subprocess

# Check if packages are already installed
try:
    import pandas
    import numpy
    import sklearn
    print("‚úì Required packages already installed")
except ImportError:
    print("Installing required packages...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"])
    print("‚úì Packages installed successfully")

In [None]:
# Install the project in editable mode
try:
    import src
    print("‚úì ML Project Framework is already installed")
except ImportError:
    print("Installing ML Project Framework...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-e", "."])
    print("‚úì Framework installed successfully")

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression
from datetime import datetime
import json

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì All libraries imported successfully")

In [None]:
# Import framework modules
from src.utils import get_logger, load_config
from src.data import DataProcessor
from src.features import FeatureEngineer, build_features
from src.models import ModelTrainer
from src.evaluation import evaluate_model, plot_confusion_matrix, plot_feature_importance

print("‚úì Framework modules imported successfully")

## 3Ô∏è‚É£ Load and Explore Configuration Management

Load and display the YAML configuration file that controls the ML pipeline.


In [None]:
# Initialize logger
logger = get_logger('ml_pipeline', log_dir='logs')
logger.info("Starting ML Pipeline Tutorial")

# Load configuration
config = load_config()
config_dict = config.get_all()

print("‚úì Configuration loaded successfully\n")
print("Configuration Overview:")
print("="*60)

In [None]:
# Display configuration sections
print("Project Configuration:")
print(f"  Name: {config.get('project.name')}")
print(f"  Description: {config.get('project.description')}")
print(f"  Version: {config.get('project.version')}")

print("\nProblem Configuration:")
print(f"  Type: {config.get('problem.type')}")
print(f"  Task: {config.get('problem.task')}")

print("\nData Configuration:")
print(f"  Target Variable: {config.get('data.target_variable')}")
print(f"  Test Split: {config.get('data.test_split')}")
print(f"  Validation Split: {config.get('data.validation_split')}")

print("\nModel Configuration:")
print(f"  Algorithm: {config.get('model.algorithm')}")
print(f"  Hyperparameters: {config.get('model.params')}")

## 4Ô∏è‚É£ Data Pipeline - Loading and Cleaning

Implement the complete data loading and cleaning pipeline.


In [None]:
# Create sample dataset for demonstration
print("üìä Creating sample dataset for demonstration...")

problem_type = config.get('problem.type', 'classification')

if problem_type == 'classification':
    X, y = make_classification(
        n_samples=1000,
        n_features=15,
        n_informative=10,
        n_redundant=3,
        n_classes=2,
        random_state=42
    )
else:
    X, y = make_regression(
        n_samples=1000,
        n_features=15,
        n_informative=10,
        random_state=42
    )

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df[config.get('data.target_variable', 'target')] = y

print(f"‚úì Dataset created: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())

In [None]:
# Explore dataset characteristics
print("Dataset Statistics:")
print("="*60)

processor = DataProcessor(random_state=42)
summary = processor.get_data_summary(df)

print(f"Shape: {summary['shape']}")
print(f"\nData Types:")
for col, dtype in summary['dtypes'].items():
    print(f"  {col}: {dtype}")

print(f"\nMissing Values:")
if all(v == 0 for v in summary['missing_values'].values()):
    print("  No missing values found ‚úì")
else:
    for col, count in summary['missing_values'].items():
        if count > 0:
            print(f"  {col}: {count} ({summary['missing_percentage'][col]:.2f}%)")

print(f"\nDuplicate Rows: {summary['duplicates']}")

In [None]:
# Clean data
print("Data Cleaning Pipeline:")
print("="*60)

# Handle missing values
df_clean = processor.handle_missing_values(df, strategy='drop')
print(f"‚úì Missing values handled")

# Remove duplicates
df_clean = processor.remove_duplicates(df_clean)
print(f"‚úì Duplicates removed")

# Display cleaned data info
summary_clean = processor.get_data_summary(df_clean)
print(f"\nCleaned Dataset Shape: {summary_clean['shape']}")
print(f"Rows removed: {df.shape[0] - df_clean.shape[0]}")

In [None]:
# Data splitting
print("\nData Splitting:")
print("="*60)

target_col = config.get('data.target_variable', 'target')
test_size = config.get('data.test_split', 0.2)

X_train, X_test, y_train, y_test = processor.split_data(
    df_clean,
    target_col=target_col,
    test_size=test_size
)

print(f"Training Set: {X_train.shape[0]} samples ({(1-test_size)*100:.0f}%)")
print(f"Test Set: {X_test.shape[0]} samples ({test_size*100:.0f}%)")
print(f"Features: {X_train.shape[1]}")
print(f"\nTarget Distribution (Training):")
print(y_train.value_counts())

## 5Ô∏è‚É£ Feature Engineering and Preprocessing

Apply feature scaling, encoding, and engineering to create more informative features.


In [None]:
print("Feature Engineering Pipeline:")
print("="*60)

engineer = FeatureEngineer()

# Scale features
scaling_method = config.get('features.scaling.method', 'standard')
X_train_scaled = engineer.scale_features(X_train, method=scaling_method, fit=True)
X_test_scaled = engineer.scale_features(X_test, method=scaling_method, fit=False)

print(f"‚úì Features scaled using {scaling_method} scaling")
print(f"  Train shape: {X_train_scaled.shape}")
print(f"  Test shape: {X_test_scaled.shape}")

In [None]:
# Display feature statistics before and after scaling
print("\nFeature Statistics (Before Scaling):")
print(X_train.iloc[:5].describe().loc[['mean', 'std', 'min', 'max']])

print("\nFeature Statistics (After Scaling):")
print(X_train_scaled.iloc[:5].describe().loc[['mean', 'std', 'min', 'max']])

In [None]:
# Create polynomial features (optional)
use_poly = config.get('features.engineering.polynomial_features.enabled', False)
if use_poly:
    poly_degree = config.get('features.engineering.polynomial_features.degree', 2)
    X_train_eng = engineer.create_polynomial_features(X_train_scaled, degree=poly_degree, fit=True)
    X_test_eng = engineer.create_polynomial_features(X_test_scaled, degree=poly_degree, fit=False)
    print(f"‚úì Polynomial features created (degree={poly_degree})")
    print(f"  Train shape: {X_train_eng.shape}")
    print(f"  Test shape: {X_test_eng.shape}")
else:
    X_train_eng = X_train_scaled.copy()
    X_test_eng = X_test_scaled.copy()
    print("‚úì Polynomial features disabled")

In [None]:
print("\nFeature Engineering Summary:")
print(f"  Original Features: {X_train.shape[1]}")
print(f"  Engineered Features: {X_train_eng.shape[1]}")
print(f"  Feature Increase: {X_train_eng.shape[1] - X_train.shape[1]}")

## 6Ô∏è‚É£ Model Training with Multiple Algorithms

Train machine learning models using different algorithms and hyperparameters.


In [None]:
# Initialize model trainer
trainer = ModelTrainer(random_state=42)

# Get configuration
algorithm = config.get('model.algorithm', 'random_forest')
model_params = config.get('model.params', {})
problem_type = config.get('problem.type', 'classification')

print("Model Training:")
print("="*60)
print(f"Algorithm: {algorithm}")
print(f"Problem Type: {problem_type}")
print(f"Hyperparameters: {model_params}")
print()

In [None]:
# Train the primary model
trainer.train(
    X_train_eng,
    y_train,
    algorithm=algorithm,
    problem_type=problem_type,
    params=model_params
)

print(f"\n‚úì {algorithm} model trained successfully")

In [None]:
# Display model information
print(f"\nModel Details:")
print(f"  Type: {type(trainer.model).__name__}")
print(f"  Training Samples: {trainer.training_history['train_samples']}")
print(f"  Features: {trainer.training_history['train_features']}")
print(f"  Trained At: {trainer.training_history['timestamp']}")

In [None]:
# Train alternative models for comparison
print("\nTraining Alternative Models for Comparison:")
print("="*60)

alternative_algorithms = ['gradient_boosting', 'logistic_regression'] if problem_type == 'classification' else ['gradient_boosting']
alternative_models = {}

for alt_algo in alternative_algorithms:
    try:
        alt_trainer = ModelTrainer(random_state=42)
        alt_trainer.train(
            X_train_eng,
            y_train,
            algorithm=alt_algo,
            problem_type=problem_type,
            params={}
        )
        alternative_models[alt_algo] = alt_trainer
        print(f"‚úì {alt_algo} model trained")
    except Exception as e:
        print(f"‚úó {alt_algo} model failed: {str(e)}")

## 7Ô∏è‚É£ Model Evaluation and Metrics Visualization

Evaluate model performance using comprehensive metrics and visualizations.


In [None]:
# Make predictions
print("Model Evaluation:")
print("="*60)

y_train_pred = trainer.predict(X_train_eng)
y_test_pred = trainer.predict(X_test_eng)

print(f"‚úì Predictions generated")
print(f"  Train predictions shape: {y_train_pred.shape}")
print(f"  Test predictions shape: {y_test_pred.shape}")

In [None]:
# Get prediction probabilities for classification
if problem_type == 'classification' and hasattr(trainer.model, 'predict_proba'):
    y_train_proba = trainer.predict_proba(X_train_eng)
    y_test_proba = trainer.predict_proba(X_test_eng)
    print(f"‚úì Prediction probabilities generated")
    print(f"  Train probabilities shape: {y_train_proba.shape}")
    print(f"  Test probabilities shape: {y_test_proba.shape}")
else:
    y_train_proba = None
    y_test_proba = None

In [None]:
# Evaluate on training set
print("\nTraining Set Performance:")
train_metrics = evaluate_model(
    y_train,
    y_train_pred,
    problem_type=problem_type,
    y_proba=y_train_proba
)

for key, value in train_metrics.items():
    if key not in ['confusion_matrix', 'classification_report']:
        if isinstance(value, float):
            print(f"  {key}: {value:.4f}")

In [None]:
# Evaluate on test set
print("\nTest Set Performance:")
test_metrics = evaluate_model(
    y_test,
    y_test_pred,
    problem_type=problem_type,
    y_proba=y_test_proba
)

for key, value in test_metrics.items():
    if key not in ['confusion_matrix', 'classification_report']:
        if isinstance(value, float):
            print(f"  {key}: {value:.4f}")

In [None]:
# Visualize confusion matrix
if problem_type == 'classification':
    print("\nConfusion Matrix:")
    cm_data = plot_confusion_matrix(y_test, y_test_pred)
    cm = np.array(cm_data['matrix'])
    
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title('Confusion Matrix - Test Set')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.tight_layout()
    plt.show()
    
    print(f"\n{cm_data['title']}")
    print(cm)

In [None]:
# Feature importance
if hasattr(trainer.model, 'feature_importances_'):
    print("\nFeature Importance Analysis:")
    feature_names = X_train_eng.columns
    importances = trainer.model.feature_importances_
    feature_importance_dict = dict(zip(feature_names, importances))
    
    # Plot top 10 features
    top_n = min(10, len(feature_importance_dict))
    importance_data = plot_feature_importance(feature_importance_dict, top_n=top_n)
    
    plt.figure(figsize=(10, 6))
    plt.barh(importance_data['features'], importance_data['importance'])
    plt.xlabel('Importance')
    plt.title(importance_data['title'])
    plt.tight_layout()
    plt.show()
    
    print(f"\nTop {top_n} Features:")
    for feat, imp in zip(importance_data['features'], importance_data['importance']):
        print(f"  {feat}: {imp:.4f}")

## 8Ô∏è‚É£ Experiment Tracking and Results Management

Save and manage experiment results, models, and metrics systematically.


In [None]:
# Save model
print("Experiment Management:")
print("="*60)

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
model_path = f'models/final/{algorithm}_model_{timestamp}.pkl'

try:
    os.makedirs(os.path.dirname(model_path), exist_ok=True)
    trainer.save_model(model_path)
    print(f"‚úì Model saved: {model_path}")
except Exception as e:
    print(f"‚úó Failed to save model: {e}")

In [None]:
# Save metrics
metrics_path = f'experiments/metrics_{timestamp}.json'

try:
    os.makedirs(os.path.dirname(metrics_path), exist_ok=True)
    
    # Prepare metrics for JSON serialization
    metrics_to_save = {}
    for key, value in test_metrics.items():
        if key not in ['confusion_matrix', 'classification_report']:
            if isinstance(value, (int, float)):
                metrics_to_save[key] = value
            elif hasattr(value, 'tolist'):
                metrics_to_save[key] = value.tolist()
            else:
                metrics_to_save[key] = str(value)
    
    with open(metrics_path, 'w') as f:
        json.dump(metrics_to_save, f, indent=2)
    
    print(f"‚úì Metrics saved: {metrics_path}")
except Exception as e:
    print(f"‚úó Failed to save metrics: {e}")

In [None]:
# Save experiment summary
summary_path = f'experiments/experiment_summary_{timestamp}.json'

try:
    experiment_summary = {
        'timestamp': timestamp,
        'algorithm': algorithm,
        'problem_type': problem_type,
        'configuration': {
            'hyperparameters': model_params,
            'train_samples': X_train_eng.shape[0],
            'test_samples': X_test_eng.shape[0],
            'features': X_train_eng.shape[1]
        },
        'performance': {key: value for key, value in test_metrics.items() if isinstance(value, (int, float))},
        'model_path': model_path,
        'metrics_path': metrics_path
    }
    
    with open(summary_path, 'w') as f:
        json.dump(experiment_summary, f, indent=2)
    
    print(f"‚úì Experiment summary saved: {summary_path}")
except Exception as e:
    print(f"‚úó Failed to save summary: {e}")

## 9Ô∏è‚É£ Running the End-to-End Pipeline

Execute the complete machine learning pipeline with a single function call.


In [None]:
print("\nComplete Pipeline Execution Summary:")
print("="*60)
print(f"\n‚úì Steps Completed:")
print(f"  1. Environment setup and dependency installation")
print(f"  2. Configuration loaded from config.yaml")
print(f"  3. Data loaded and cleaned: {df_clean.shape}")
print(f"  4. Train/test split: {X_train_eng.shape[0]}/{X_test_eng.shape[0]}")
print(f"  5. Features engineered: {X_train.shape[1]} ‚Üí {X_train_eng.shape[1]}")
print(f"  6. Model trained: {algorithm}")
print(f"  7. Performance evaluated")
print(f"  8. Results saved to experiments/")
print(f"  9. Model saved to models/final/")

print(f"\nüìä Final Performance:")
for key, value in test_metrics.items():
    if key not in ['confusion_matrix', 'classification_report']:
        if isinstance(value, float):
            print(f"  {key}: {value:.4f}")

In [None]:
print("\nüìÅ Output Artifacts:")
print(f"  Model: {model_path}")
print(f"  Metrics: {metrics_path}")
print(f"  Summary: {summary_path}")
print(f"  Logs: logs/")

print("\nüéâ Pipeline execution complete!")
print("\nüí° Next Steps:")
print("  1. Modify config.yaml to customize hyperparameters")
print("  2. Add your own data to data/raw/")
print("  3. Experiment with different algorithms")
print("  4. Run: python run_pipeline.py for batch execution")
print("  5. Review results in experiments/ directory")

## üìö Next Steps

### Customize Your Project

1. **Edit Configuration**: Modify `config/config.yaml` to change algorithms, hyperparameters, and data sources
2. **Add Your Data**: Place your dataset in `data/raw/` and update the configuration
3. **Run Pipeline**: Execute `python run_pipeline.py` for batch processing
4. **Analyze Results**: Check `experiments/` for saved metrics and artifacts

### Key Files to Modify

- `config/config.yaml` - Main configuration file
- `docs/project_requirements.md` - Project documentation template
- `src/features/build_features.py` - Add custom features
- `src/models/train_model.py` - Add new algorithms
- `src/evaluation/evaluate_model.py` - Add custom metrics

### Resources

- See `README.md` for complete documentation
- See `QUICKSTART.md` for quick reference
- Check inline code comments for detailed explanations
- Review `docs/project_requirements.md` for project planning

---

**Good luck with your ML project! üöÄ**
