# OCR Pipeline Runner and App Launcher

This notebook lets you:

1. **Run the training pipeline** to train the Decision Tree and Random Forest models on EMNIST letters.
2. **Save trained models as artifacts** under `data/processed`.
3. **Launch the Streamlit app** that uses those saved models for interactive exploration.

> Tip: Run the cells from top to bottom. Make sure you have installed the dependencies in `requirments.txt` first.


In [1]:
# Ensure the project root is on sys.path so imports work when running this notebook
import sys
from pathlib import Path

# When the notebook lives in `notebooks/`, the project root is its parent directory.
PROJECT_ROOT = Path.cwd().resolve().parent
if (PROJECT_ROOT / "src").exists() and str(PROJECT_ROOT / "src") not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT / "src"))

print("Project root:", PROJECT_ROOT)
print("Python path updated with:", PROJECT_ROOT / "src")


Project root: G:\University\Third Year Term 1\AI\Letter_OCR_Project
Python path updated with: G:\University\Third Year Term 1\AI\Letter_OCR_Project\src


In [2]:
# Optional: verify config paths and that data directories exist
from ocr_project import config

print("EMNIST train:", config.EMNIST_LETTERS_TRAIN)
print("EMNIST test:", config.EMNIST_LETTERS_TEST)
print("Data dir:", config.DATA_DIR)
print("Processed dir:", config.PROCESSED_DATA_DIR)

# Ensure the key directories exist
config.ensure_directories()


EMNIST train: G:\University\Third Year Term 1\AI\Letter_OCR_Project\data\raw\emnist-letters-train.csv
EMNIST test: G:\University\Third Year Term 1\AI\Letter_OCR_Project\data\raw\emnist-letters-test.csv
Data dir: G:\University\Third Year Term 1\AI\Letter_OCR_Project\data
Processed dir: G:\University\Third Year Term 1\AI\Letter_OCR_Project\data\processed


In [3]:
# Train three pipelines: pixels, HOG, and PCA, **plus** the new strong default.
# Each saves its models into a separate artifacts folder so the app can
# compare all combinations, and the default HOG+PCA pipeline saves to `artifacts`.
# 
# NOTE: The pipeline now uses Cross-Validation on training data (5 folds) and
# evaluates on a separate holdout test set.
from ocr_project.pipeline import OCRPipeline, run_default_pipeline
from ocr_project.features.hog_features import HOGFeatureExtractor
from ocr_project.features.pca_features import PCAFeatures
from ocr_project import config

all_results = {}

# 1) Pixels only (baseline)
print("=== Pixels only (baseline) ===")
artifacts_pixels = config.PROJECT_ROOT / "artifacts_pixels"
pipeline_pixels = OCRPipeline(artifacts_dir=artifacts_pixels)
results_pixels = pipeline_pixels.run()
for name, result in results_pixels.items():
    if result.accuracy_std > 0:
        print(f"- {name}: CV = {result.accuracy:.4f} (±{result.accuracy_std:.4f}), Test = {result.test_accuracy:.4f}")
    else:
        print(f"- {name}: Test = {result.test_accuracy:.4f}")
all_results["pixels"] = results_pixels

# 2) HOG features
print("\n=== HOG features ===")
artifacts_hog = config.PROJECT_ROOT / "artifacts_hog"
pipeline_hog = OCRPipeline(
        feature_extractor=HOGFeatureExtractor(),
        artifacts_dir=artifacts_hog,
)
results_hog = pipeline_hog.run()
for name, result in results_hog.items():
    if result.accuracy_std > 0:
        print(f"- {name}: CV = {result.accuracy:.4f} (±{result.accuracy_std:.4f}), Test = {result.test_accuracy:.4f}")
    else:
        print(f"- {name}: Test = {result.test_accuracy:.4f}")
all_results["hog"] = results_hog

# 3) PCA features
print("\n=== PCA features ===")
artifacts_pca = config.PROJECT_ROOT / "artifacts_pca"
pipeline_pca = OCRPipeline(
        feature_extractor=PCAFeatures(n_components=50),
        artifacts_dir=artifacts_pca,
)
results_pca = pipeline_pca.run()
for name, result in results_pca.items():
    if result.accuracy_std > 0:
        print(f"- {name}: CV = {result.accuracy:.4f} (±{result.accuracy_std:.4f}), Test = {result.test_accuracy:.4f}")
    else:
        print(f"- {name}: Test = {result.test_accuracy:.4f}")
all_results["pca"] = results_pca

# 4) Strong default pipeline (HOG+PCA + powerful Random Forest) → saves to `artifacts`
print("\n=== Default HOG+PCA pipeline (artifacts/) ===")
default_results = run_default_pipeline()
for name, result in default_results.items():
    if result.accuracy_std > 0:
        print(f"- {name}: CV = {result.accuracy:.4f} (±{result.accuracy_std:.4f}), Test = {result.test_accuracy:.4f}")
    else:
        print(f"- {name}: Test = {result.test_accuracy:.4f}")
all_results["default_hog_pca"] = default_results

print("\nTraining complete for all configurations.")
print("CV = Cross-Validation mean accuracy on training data")
print("Test = Accuracy on holdout test set (not used during training)")

=== Pixels only (baseline) ===
Evaluating models using: Cross-Validation (on training data)
Number of CV folds: 5
Test data is kept as a separate holdout set.

Training Decision Tree...
  CV Mean Accuracy: 0.6817 (+/- 0.0037)
  Test Set Accuracy: 0.6818

Training Random Forest...
  CV Mean Accuracy: 0.8716 (+/- 0.0027)
  Test Set Accuracy: 0.8621
- decision_tree: CV = 0.6817 (±0.0037), Test = 0.6818
- random_forest: CV = 0.8716 (±0.0027), Test = 0.8621

=== HOG features ===
Evaluating models using: Cross-Validation (on training data)
Number of CV folds: 5
Test data is kept as a separate holdout set.

Training Decision Tree...
  CV Mean Accuracy: 0.6490 (+/- 0.0044)
  Test Set Accuracy: 0.6286

Training Random Forest...
  CV Mean Accuracy: 0.8586 (+/- 0.0015)
  Test Set Accuracy: 0.8426
- decision_tree: CV = 0.6490 (±0.0044), Test = 0.6286
- random_forest: CV = 0.8586 (±0.0015), Test = 0.8426

=== PCA features ===
Evaluating models using: Cross-Validation (on training data)
Number of CV

In [4]:
# Inspect saved artifacts used by the Streamlit app
from pathlib import Path
from ocr_project import config

# The OCRPipeline now saves into a top-level "artifacts" directory so that
# app.py can load models from there directly.
artifacts_dir = config.PROJECT_ROOT / "artifacts"
print("Artifacts saved under:", artifacts_dir)

for p in sorted(artifacts_dir.glob("*.pkl")):
    print("-", p.name)


Artifacts saved under: G:\University\Third Year Term 1\AI\Letter_OCR_Project\artifacts
- decision_tree.pkl
- random_forest.pkl


## Launch the Streamlit app

The following cell launches the Streamlit app defined in `app.py`.

- **In Jupyter / VS Code / Cursor:** this will start Streamlit in a separate process.
- Open the URL it prints (typically `http://localhost:8501`) in your browser.

Stop the app with **Ctrl+C** in the terminal when you are done.


In [5]:
# import sys
# import subprocess
# from pathlib import Path
# 
# # Launch the Streamlit app from this notebook
# # We run it with cwd set to the project root so that app.py and artifacts are found.
# from ocr_project import config
# project_root = config.PROJECT_ROOT
# 
# cmd = [sys.executable, "-m", "streamlit", "run", "app.py"]
# print("Running:", " ".join(cmd))
# print("Working directory:", project_root)
# print("If nothing happens, open http://localhost:8501 manually in your browser.")
# 
# # Run Streamlit and capture its output so we can see why it exits
# result = subprocess.run(cmd, cwd=str(project_root), capture_output=True, text=True)
# 
# print("Return code:", result.returncode)
# print("\n--- STDOUT ---\n")
# print(result.stdout)
# print("\n--- STDERR ---\n")
# print(result.stderr)


In [6]:
# import sys, subprocess
# 
# print("Notebook Python:", sys.executable)
# subprocess.run([sys.executable, "-m", "pip", "install", "streamlit"])