## Assignment 4: Applying AutoML with Ludwig on MNIST Dataset
  
**DONG**

### Part 1: Conceptual Questions

##### 1. Describing AutoML

Automated Machine Learning (AutoML) is the process of automating the end-to-end steps of applying machine learning to real-world problems, encompassing tasks like feature engineering, model selection, hyperparameter optimization, and iterative evaluation. Its primary benefits include significantly reducing the time and expertise required to achieve a baseline high-performing model, democratizing ML access, and accelerating experimentation. However, limitations include the computational cost of searching large model spaces, reduced transparency (less control over the final pipeline), and the inability of current AutoML tools to fully automate creative problem framing or expert domain knowledge application.



##### 2. Explaining Ludwig's Role

Ludwig simplifies the creation of machine learning models by providing a `declarative, code-free approach` where users define the task using a simple YAML configuration file, specifying the input data columns and the target output columns. Ludwig then handles the underlying TensorFlow/PyTorch model architecture creation, training, and hyperparameter management automatically. It is designed to address a wide range of supervised learning problems, including regression, classification, sequence tagging, machine translation, and multimodal tasks involving combinations of text, images, and numerical data.



##### 3. Discussing Data Preparation in AutoML

Data preparation is critically important in an AutoML workflow because the quality and structure of the input data directly constrain the performance of any automatically optimized model. Ludwig handles data preprocessing automatically by mapping data types specified in the configuration (e.g., text, image, numerical) to the appropriate encoders and preprocessors (e.g., tokenizers for text, normalization for numbers). This automatic preprocessing, which is defined implicitly by the input and output types chosen by the user, eliminates the manual coding typically required for feature engineering and preparation, ensuring data is transformed into a format consumable by neural network models.



##### 4. Describing Model Serving in AutoML

Model serving is the process of deploying a trained ML model into a production environment where it can receive live data inputs (inference requests) and return predictions with low latency. Ludwig supports model serving by exporting trained models in a standardized format that can be easily integrated into common serving infrastructures. Specifically, Ludwig models can be exported into the standard SavedModel format for TensorFlow, making them readily deployable using high-performance serving frameworks like TensorFlow Serving, thereby allowing the model to be exposed via a low-latency REST or gRPC API endpoint.



##### 5. Predicting the Future of Open-Source AutoML Tools

The future of open-source AutoML tools like Ludwig is bright, driven by increasing community contributions and transparency. Compared to proprietary tools, open-source platforms offer superior accessibility (free to use and inspect the code) and 
community support (rapid bug fixes, diverse integration possibilities, and extensive documentation). While proprietary tools may currently lead in certain advanced, proprietary algorithms or offer more tightly integrated cloud infrastructure, open-source tools will continue to close the functionality gap, focusing on better interoperability, multimodal support, and automated hyperparameter optimization, ultimately becoming the standard choice for researchers and small-to-midsize enterprises.

In [None]:
"""
MNIST Handwritten Digit Classification with Ludwig
===================================================

This script demonstrates how to use Ludwig to build, train, and evaluate
a deep learning model for the MNIST dataset.

Setup Instructions:
------------------
1. Create a Python 3.8+ virtual environment (3.6 also works but is deprecated):
   python3 -m venv ludwig_env
   source ludwig_env/bin/activate  # On Windows: ludwig_env\Scripts\activate

2. Install Ludwig with image support:
   pip install ludwig[full]
   # Or minimal: pip install ludwig

3. Download the MNIST CSV files to your working directory

Usage:
------
python mnist_ludwig.py
"""

In [21]:
# !python3 -m venv ludwig_env
# !ludwig_env\Scripts\activate
!pip install ludwig
# !pip install torch torchvision torchaudio
# !pip install jupyterlab-server jupyter-events --upgrade
# !ludwig train --config_file config.yaml --dataset data.csv
# !pip uninstall jsonschema
# !pip install jsonschema==4.6.2

# !pip install fastapi uvicorn httpx
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Collecting ludwig
  Using cached ludwig-0.10.4.tar.gz (1.1 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting Cython>=0.25 (from ludwig)
  Using cached cython-3.2.2-cp313-cp313-win_amd64.whl.metadata (5.1 kB)
Collecting pandas!=1.1.5,<2.2.0,>=1.0 (from ludwig)
  Using cached pandas-2.1.4.tar.gz (4.3 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'error'


  error: subprocess-exited-with-error
  
  pip subprocess to install build dependencies did not run successfully.
  exit code: 1
  
  [57 lines of output]
  Ignoring oldest-supported-numpy: markers 'python_version < "3.12"' don't match your environment
  Collecting meson-python==0.13.1
    Using cached meson_python-0.13.1-py3-none-any.whl.metadata (4.1 kB)
  Collecting meson==1.2.1
    Using cached meson-1.2.1-py3-none-any.whl.metadata (1.7 kB)
  Collecting wheel
    Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
  Collecting Cython<3,>=0.29.33
    Using cached Cython-0.29.37-py2.py3-none-any.whl.metadata (3.1 kB)
  Collecting numpy<2,>=1.26.0
    Using cached numpy-1.26.4.tar.gz (15.8 MB)
    Installing build dependencies: started
    Installing build dependencies: finished with status 'done'
    Getting requirements to build wheel: started
    Getting requirements to build wheel: finished with status 'done'
    Installing backend dependencies: started
    Installing bac

In [19]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("hojjatk/mnist-dataset")

print("Path to dataset files:", path)

ModuleNotFoundError: No module named 'kagglehub'

In [22]:
import os
import pandas as pd
import requests
from ludwig.api import LudwigModel
import yaml

ModuleNotFoundError: No module named 'ludwig'

In [23]:
# Configuration
DATA_DIR = "mnist_data"
TRAIN_URL = "https://pjreddie.com/media/files/mnist_train.csv"
TEST_URL = "https://pjreddie.com/media/files/mnist_test.csv"
TRAIN_FILE = os.path.join(DATA_DIR, "mnist_train.csv")
TEST_FILE = os.path.join(DATA_DIR, "mnist_test.csv")


def download_data():
    """Download MNIST CSV files if not already present."""
    os.makedirs(DATA_DIR, exist_ok=True)
    
    for url, filepath in [(TRAIN_URL, TRAIN_FILE), (TEST_URL, TEST_FILE)]:
        if not os.path.exists(filepath):
            print(f"Downloading {filepath}...")
            response = requests.get(url)
            with open(filepath, 'wb') as f:
                f.write(response.content)
            print(f"Downloaded {filepath}")
        else:
            print(f"{filepath} already exists")


def prepare_data():
    """Load and prepare the MNIST data."""
    print("\nLoading data...")
    
    # Load CSV files
    # First column is the label, remaining 784 columns are pixel values (28x28)
    train_df = pd.read_csv(TRAIN_FILE, header=None)
    test_df = pd.read_csv(TEST_FILE, header=None)
    
    # Rename columns for clarity
    train_df.columns = ['label'] + [f'pixel_{i}' for i in range(784)]
    test_df.columns = ['label'] + [f'pixel_{i}' for i in range(784)]
    
    # Save preprocessed data
    train_processed = os.path.join(DATA_DIR, "mnist_train_processed.csv")
    test_processed = os.path.join(DATA_DIR, "mnist_test_processed.csv")
    
    train_df.to_csv(train_processed, index=False)
    test_df.to_csv(test_processed, index=False)
    
    print(f"Training samples: {len(train_df)}")
    print(f"Test samples: {len(test_df)}")
    print(f"Features: {len(train_df.columns) - 1}")
    print(f"Classes: {train_df['label'].nunique()}")
    
    return train_processed, test_processed


def create_ludwig_config_basic():
    """Create a basic Ludwig configuration for MNIST."""
    config = {
        'input_features': [
            {
                'name': 'pixel_features',
                'type': 'number',
                'preprocessing': {
                    'normalization': 'zscore'
                },
                'encoder': {
                    'type': 'dense',
                    'layers': [
                        {'output_size': 256, 'activation': 'relu'},
                        {'output_size': 128, 'activation': 'relu'}
                    ]
                }
            }
        ],
        'output_features': [
            {
                'name': 'label',
                'type': 'category',
                'decoder': {
                    'type': 'classifier',
                    'num_fc_layers': 2,
                    'output_size': 64
                }
            }
        ],
        'trainer': {
            'epochs': 10,
            'batch_size': 128,
            'learning_rate': 0.001,
            'early_stop': 5
        }
    }
    
    config_path = os.path.join(DATA_DIR, 'config_basic.yaml')
    with open(config_path, 'w') as f:
        yaml.dump(config, f)
    
    print(f"\nBasic configuration saved to {config_path}")
    return config


def create_ludwig_config_advanced():
    """Create an advanced Ludwig configuration with CNN-like architecture."""
    # Since Ludwig expects flattened features for number type,
    # we'll use a more complex dense architecture
    config = {
        'input_features': [
            {
                'name': f'pixel_{i}',
                'type': 'number',
                'preprocessing': {
                    'normalization': 'minmax'  # Normalize pixels to [0, 1]
                }
            } for i in range(784)
        ],
        'combiner': {
            'type': 'concat',
            'num_fc_layers': 3,
            'output_size': 256,
            'fc_layers': [
                {'output_size': 512, 'activation': 'relu', 'dropout': 0.2},
                {'output_size': 256, 'activation': 'relu', 'dropout': 0.2},
                {'output_size': 128, 'activation': 'relu', 'dropout': 0.1}
            ]
        },
        'output_features': [
            {
                'name': 'label',
                'type': 'category',
                'decoder': {
                    'type': 'classifier',
                    'num_fc_layers': 1,
                    'output_size': 64
                },
                'loss': {
                    'type': 'softmax_cross_entropy'
                }
            }
        ],
        'trainer': {
            'epochs': 15,
            'batch_size': 256,
            'learning_rate': 0.001,
            'learning_rate_scheduler': {
                'type': 'reduce_on_plateau',
                'factor': 0.5,
                'patience': 3
            },
            'early_stop': 5,
            'optimizer': {
                'type': 'adam'
            },
            'validation_metric': 'accuracy'
        }
    }
    
    config_path = os.path.join(DATA_DIR, 'config_advanced.yaml')
    with open(config_path, 'w') as f:
        yaml.dump(config, f)
    
    print(f"Advanced configuration saved to {config_path}")
    return config


def train_model(config, train_file, test_file, model_name="basic"):
    """Train a Ludwig model with the given configuration."""
    print(f"\n{'='*60}")
    print(f"Training {model_name.upper()} model...")
    print(f"{'='*60}")
    
    # Initialize Ludwig model
    model = LudwigModel(config=config, logging_level='INFO')
    
    # Train the model
    train_stats, preprocessed_data, output_directory = model.train(
        dataset=train_file,
        test=test_file,
        experiment_name=f'mnist_{model_name}',
        model_name=f'mnist_model_{model_name}'
    )
    
    print(f"\nModel trained and saved to: {output_directory}")
    return model, train_stats, output_directory


def evaluate_model(model, test_file):
    """Evaluate the model on the test set."""
    print("\n" + "="*60)
    print("Evaluating model on test set...")
    print("="*60)
    
    # Evaluate
    eval_stats, predictions, output_directory = model.evaluate(
        dataset=test_file,
        collect_predictions=True,
        collect_overall_stats=True
    )
    
    # Print evaluation metrics
    print("\nTest Set Performance:")
    print("-" * 40)
    for metric, value in eval_stats['label'].items():
        if isinstance(value, (int, float)):
            print(f"{metric}: {value:.4f}")
    
    return eval_stats, predictions


def visualize_results(model, test_file, output_dir):
    """Generate visualizations of model performance."""
    print("\n" + "="*60)
    print("Generating visualizations...")
    print("="*60)
    
    try:
        from ludwig.visualize import learning_curves, confusion_matrix, compare_performance
        
        # Learning curves
        print("Creating learning curves...")
        learning_curves([output_dir], output_directory=output_dir)
        
        # Confusion matrix
        print("Creating confusion matrix...")
        confusion_matrix([output_dir], test_file, 
                        'label', output_directory=output_dir)
        
        print(f"\nVisualizations saved to: {output_dir}")
    except Exception as e:
        print(f"Visualization error (optional): {e}")
        print("Install matplotlib for visualizations: pip install matplotlib")


def main():
    """Main execution function."""
    print("="*60)
    print("MNIST Classification with Ludwig")
    print("="*60)
    
    # Step 1: Download data
    print("\nStep 1: Downloading data...")
    download_data()
    
    # Step 2: Prepare data
    print("\nStep 2: Preparing data...")
    train_file, test_file = prepare_data()
    
    # Step 3: Choose configuration
    print("\nStep 3: Creating model configuration...")
    print("\nChoose model configuration:")
    print("1. Basic (faster, simpler architecture)")
    print("2. Advanced (better performance, more complex)")
    
    choice = input("\nEnter choice (1 or 2, default=1): ").strip() or "1"
    
    if choice == "2":
        config = create_ludwig_config_advanced()
        model_name = "advanced"
    else:
        config = create_ludwig_config_basic()
        model_name = "basic"
    
    # Step 4: Train model
    model, train_stats, output_dir = train_model(
        config, train_file, test_file, model_name
    )
    
    # Step 5: Evaluate model
    eval_stats, predictions = evaluate_model(model, test_file)
    
    # Step 6: Visualize results
    visualize_results(model, test_file, output_dir)
    
    print("\n" + "="*60)
    print("Training and evaluation complete!")
    print("="*60)
    print(f"\nModel saved to: {output_dir}")
    print(f"You can load this model later using:")
    print(f"  from ludwig.api import LudwigModel")
    print(f"  model = LudwingModel.load('{output_dir}')")


if __name__ == "__main__":
    main()

MNIST Classification with Ludwig

Step 1: Downloading data...
Downloading mnist_data\mnist_train.csv...
Downloaded mnist_data\mnist_train.csv
Downloading mnist_data\mnist_test.csv...
Downloaded mnist_data\mnist_test.csv

Step 2: Preparing data...

Loading data...


ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 2


The code is running but I was not able to install ludwig after so many