# PaddleOCR Training Launcher for Kaggle

This notebook sets up and launches PaddleOCR training on Kaggle GPUs.

## Instructions:
1. Upload your training data to Kaggle Datasets as 'paddleocr-training-data'
2. Add your GitHub repo 'Gl4d3/paddleocr-train' as a Kaggle Dataset 'github-paddleocr-training'
3. Run this notebook in a Kaggle environment with GPU enabled

## 1. Check GPU Availability

In [None]:
!nvidia-smi

## 2. Install Required Packages

In [None]:
!pip install -q paddlepaddle-gpu==2.4.2
!pip install -q mlflow paddleocr visualdl opencv-python lmdb imgaug pyclipper scikit-image

## 3. Set Up MLflow Tracking

In [None]:
import os
import mlflow
from mlflow.tracking import MlflowClient

# Set up MLflow locally (you can change this to a remote server if needed)
os.makedirs('/kaggle/working/mlruns', exist_ok=True)
mlflow.set_tracking_uri('file:///kaggle/working/mlruns')
os.environ['MLFLOW_TRACKING_URI'] = 'file:///kaggle/working/mlruns'

# Create the experiment
experiment_name = 'paddleocr_training'
mlflow.set_experiment(experiment_name)

print(f"MLflow tracking at: {mlflow.get_tracking_uri()}")
print(f"MLflow experiment: {experiment_name}")

## 4. Get PaddleOCR Code from GitHub Repository

This will use the GitHub repository you've set up.

In [None]:
# Check if dataset exists
if os.path.exists('/kaggle/input/github-paddleocr-training'):
    print("Found GitHub repository dataset")
    !ls -la /kaggle/input/github-paddleocr-training
else:
    print("GitHub repository dataset not found, cloning directly...")
    !git clone https://github.com/Gl4d3/paddleocr-train.git /kaggle/working/paddleocr

In [None]:
# Set up PaddleOCR from dataset
!cp -r /kaggle/input/github-paddleocr-training /kaggle/working/paddleocr

# Check structure
!ls -la /kaggle/working/paddleocr

## 5. Check Training Data

Make sure your training data is properly uploaded to Kaggle.

In [None]:
# Check if training data exists
if os.path.exists('/kaggle/input/paddleocr-training-data'):
    print("Found training data dataset")
    !ls -la /kaggle/input/paddleocr-training-data
else:
    print("Training data dataset not found! Please upload your training data to Kaggle Datasets.")
    print("This notebook will continue but training may fail without data.")

## 6. Copy Training Script to Working Directory

In [None]:
!cp /kaggle/working/paddleocr/model_training/notebooks/kaggle_training.py /kaggle/working/

## 7. Run Training

You can customize the training by adjusting the parameters below.

In [None]:
# Set training parameters
det_dataset_dir = '/kaggle/working/dataset/det_dataset_1'
rec_dataset_dir = '/kaggle/working/dataset/rec_dataset_1'
train_data_dir = '/kaggle/working/train_data/meter_detection'
max_det_epochs = 200  # Reduce for testing, increase for production
max_rec_epochs = 300  # Reduce for testing, increase for production

# Build command
cmd = f"""python /kaggle/working/kaggle_training.py \
    --exp_name='paddleocr_training' \
    --tracking_uri='file:///kaggle/working/mlruns' \
    --det_dataset_dir='{det_dataset_dir}' \
    --rec_dataset_dir='{rec_dataset_dir}' \
    --train_data_dir='{train_data_dir}' \
    --gpu_ids='0' \
    --max_det_epochs={max_det_epochs} \
    --max_rec_epochs={max_rec_epochs} \
    --det_batch_size=8 \
    --rec_batch_size=64
"""

print(f"Training command:\n{cmd}")

In [None]:
# Run training
!$cmd

## 8. View Training Results and Metrics

In [None]:
# List saved models
!ls -la /kaggle/working/trained_models.zip

In [None]:
# Display MLflow results
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()
experiment = client.get_experiment_by_name("paddleocr_training")
runs = client.search_runs(experiment_ids=[experiment.experiment_id])

for run in runs:
    print(f"Run ID: {run.info.run_id}")
    print(f"Status: {run.info.status}")
    print("Parameters:")
    for k, v in run.data.params.items():
        print(f"  {k}: {v}")
    print("Metrics:")
    for k, v in run.data.metrics.items():
        print(f"  {k}: {v}")
    print("====================================")

## 9. Package Results for Download

The trained models are already packaged into `trained_models.zip` which can be downloaded from Kaggle.

In [None]:
# Create MLflow artifacts archive
!mkdir -p /kaggle/working/mlflow_artifacts
!cp -r /kaggle/working/mlruns /kaggle/working/mlflow_artifacts/
!zip -r /kaggle/working/mlflow_artifacts.zip /kaggle/working/mlflow_artifacts

print("Training artifacts ready for download:")
print(" - /kaggle/working/trained_models.zip - Trained models")
print(" - /kaggle/working/mlflow_artifacts.zip - MLflow logs and metrics")