# Table of Contents

## 1. Introduction
   - 1.1. Overview of the EMNIST Dataset
     - Description and Importance
     - Structure and Dataset Variants
   - 1.2. Initial Exploratory Data Analysis (EDA)
     - 1.2.1. Essential Dataset Insights
       - Balanced Digit Distribution  
       - Visualization of Random Samples
       - Images Analysis: Pixel Density, Centre of Mass, Non-zero Proportions 
     - 1.2.2. Variance and Class-Level Pixel Statistics  
       - Feature Variance Across Digits  
       - Mean and Variance Images per Digit  
       - Visualizing Differences in Average Images

# Part I: Supervised Classification

## 2. Training and Evaluation of Baseline Models
   - 2.1. Helper Functionality for Baselining
   - 2.2. Baseline Models
      - 2.2.1. Logistic Regression
      - 2.2.2. Support Vector Machine (SVM)
      - 2.2.3. Decision Tree
      - 2.2.4. Ensemble Methods
         - 2.2.4.1. Random Forest
         - 2.2.4.2. Extreme Trees
         - 2.2.4.3. XGBoost
         - 2.2.4.4. AdaBoost
   - 2.3. Performance Evaluation: A Risk Measurement Prespective
      - 2.3.1. Helper Functionality for Modular Evaluation and Summarization
      - 2.3.2. An Initial Overview of Baseline Models Performance
      - 2.3.3. Analysis of Precision-Recall Curve
      - 2.3.4. Looking into Precision-Recall-Threshold
      - 2.3.5. Understanding ROC Curve
      - 2.3.6. F1-Score vs Threshold
      - 2.3.2. Comparative Analysis (Merged with 2.3.1)
   - 2.4. Reliability Analysis: Brier Scores
   - 2.5. Error Analysis
      - 2.5.1. Confusion Matrices Analysis: One-vs-All
      - 2.5.2. Confusion Heatmaps Analysis: One-vs-One
         - 2.5.2.1. True Negatives: The Models' Strengths
         - 2.5.2.2. False Negatives: The Toughest Challenges
         - 2.5.2.3. False Positives: The Confusing Lookalikes
      - 2.5.3. Visualization of Misclassified Images
         - 2.5.3.1. A Look into Misclassified By the Linear Models
         - 2.5.3.2. Seeing the Misclassified By the Tree-based Models
         - 2.5.3.3. Images Misclassified By the Boosting Models
         - 2.5.3.4. Poor Images Misunderstood By Every Model
   - 2.6. Strategic Insights and Forward Planning

## 3. From Baseline to Brilliance: Classifiers Optimization
   - 3.1. Role of Hyperparameter Tuning and Cross Validation
      - 3.1.1. Machine Learning As Function Approximation
      - 3.1.2. Learning As Optimization Over Function Space
      - 3.1.3. Hyperparameter Tuning As Searching Strategy
      - 3.1.4. Quantifying Generalizability Using Cross Validation
         - 3.1.4.1. The Bias-Variance Trade-off
         - 3.1.4.2. Tyoes of Cross Validation 
   - 3.2. Hyperparameter Tuning Strategies: the Exploration vs Exploitation Paradigm
      - 3.2.1. Helper Funtionality for Modularization
      - 3.2.2. HalvingGridSearchCV
      - 3.2.3. Specs, Training, Evaluation, and Quantification of Tuned Models Against the Baselines
      - 3.2.4. Remarks and Insights for Further Tuning
   - 3.3. Performance and Timing Analysis For Tuned Models
      - 3.3.1. Overview of the Results
      - 3.3.2. Precision-Recall: A Pareto Efficiency Perspective
          - 3.3.2.1. Key Observations
          - 3.3.2.2. Cost-benefit Analysis
          - 3.3.2.3. From Theory to Business
      - 3.3.3. Precision-Recall-Threshold Curve
      - 3.3.4. ROC Curve: Trade-off Between Sensitivity and Selectivity
      - 3.3.5. F1-score vs Threshold
   - 3.4. Error Analysis
      - 3.4.1. Confusion Matrix Analysis: The Binary Case of 8-vs-all
      - 3.4.2. Class-wise Confusion Matrix Analysis
          - 3.4.2.1. False Positive Analysis
          - 3.4.2.2. True Negative Analysis
      - 3.4.3. Common Errors and Insights: Identifying, Visualizing, and Analyzing
      - 3.4.4. Reliability Analysis
         - 3.4.4.1. Brier Scores
         - 3.4.4.2. Calibration Curves

## 4. The Quest for the Best Classifier
   - 4.1. Voting Ensembles: An Overview
   - 4.2. Training Hard and Soft Voting Models
      - 4.2.1. Evaluation of the Voting Models
   - 4.3. Training A Stack Ensemble
      - 4.3.2. Meta-model Selection and Training
      - 4.3.1. Evaluation of the Stack Model 
   - 4.4. Quantitative Comparison of the Voting Ensembles and the Stack Model
   - 4.5. Error Distribution Analysis
      - 4.5.1. Missclassified Frequencies
      - 4.5.2. Heatmap Representation of Missclassified Frequencies
      - 4.5.3. Analysis of Proportional Missclassifications
   - 4.6. Model Generalizability Analysis

## 5. Feature Importance and Feature Selection
   - 5.1. Introduction to Feature Importance
      - Techniques Overview (SHAP, Random Forest Importance, Coefficients in Linear Models)
      - 5.1.1. Strengths and Weaknesses of Different Methods
   - 5.2. Measuring Feature Importance
      - 5.2.1. SHAP Values
        - 5.2.1.1 Class-wise SHAP Analysis
        - 5.2.1.2. Consistently Top Features
      - 5.2.2. Random Forest Importance
        - 5.2.2.1. Static Feature Importance
        - 5.2.2.2. Dynamic Feature Importance: RFE
   - 5.3. Feature Selection
      - 5.3.1. Selecting Feature Sets Based on Different Techniques
      - 5.3.2. Visually Comparing Different Feature Selections
      - 5.3.3. Benchmarking Different Techniques: Retraining and Evaluation
   - 5.4. Understanding Error Through Feature Importance
      - 5.4.1. A Future Endeavour: Trustworthiness, Robustness, And Adversarial Training

## 6. Training Multiclass Classifiers
   - 6.1. Ways of Training a Multiclass Classifier
      - 6.1.1. Comparative Advantages and Disadvantages
   - 6.2. Training Direct Multiclass Classifiers
      - 6.2.1. Feature Selection Pipeline
      - 6.2.2. Modular Evaluation Functionality
      - 6.2.3. Naming Convention
      - 6.2.4. XGBoost
      - 6.2.5. AdaBoost
      - 6.2.6. LightGBM
      - 6.2.7. Benchmarking the Direct Models
         - 6.2.7.1. Analyzing Precision And Recall For All Models and Digits
         - 6.2.7.2. Discussion on AdaBoost's Multiclass Classifer Underperformance
   - 6.3. Training and Ensembling An AdaBoost Multiclass Classifier
      - 6.3.1. Systematic Training Multiple One-vs-All AdaBoost Models
      - 6.3.2. A Soft Voting AdaBoost Ensemble
      - 6.3.3. Benchmarking The Multiclass Soft Voting Ensemble
         - 6.3.3.1. The Ensemble of One-vs-All Multiclass Classifer vs The Multiclass AdaBoost Classifier
   - 6.4. Why The Ensemble of One-vs-All Won?
   - 6.5. Benchamrking The Soft Voting Ensemble vs XGB and LightGBM Multiclass Classifiers
   - 6.6. Error Analysis and Visualizations
   - 6.7. Quantifying Model Generalizability and Uncertainty
      - 6.7.1. Training, Validation, And Test Curves
      - 6.7.2. Measuring Uncertainty
         - 6.7.2.1. Predictive Entropy
         - 6.7.2.2. Quantifiably Distinguishing Between Overconfidence And Uncertainty   

## 7. Conclusion
   - 7.1. Limitations and Future Work