# PaddleOCR and Amazon SageMaker
This notebook explains the steps needed to train a PaddleOCR text reconition model locally and how to deploy it to an Amazon SageMaker Endpoint.
## Install and configuration

In [None]:
!git clone https://github.com/PaddlePaddle/PaddleOCR

In [None]:
# If you have CUDA 9 or CUDA 10
!pip install paddlepaddle-gpu -i https://pypi.tuna.tsinghua.edu.cn/simple
!pip install "paddleocr>=2.0.1" -q
# If running without GPU
# !pip install paddlepaddle -i https://pypi.tuna.tsinghua.edu.cn/simple

In [None]:
!pip install -r "PaddleOCR/requirements.txt"

You might need to install openssl11-libs if you get an error when importing paddleocr

In [None]:
#!sudo yum install openssl11-libs -y

## 1. Use PaddleOCR local (generic model)

In [None]:
from paddleocr import PaddleOCR,draw_ocr

In [None]:
ocr = PaddleOCR(use_angle_cls=True, lang='en')

In [None]:
img_path = 'test_images/1.png'
result = ocr.ocr(img_path, cls=True)
for idx in range(len(result)):
    res = result[idx]
    for line in res:
        print(line)

In [None]:
from PIL import Image
result = result[0]
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path='PaddleOCR/doc/fonts/simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

## 2. PaddleOCR on SageMaker Endpoints

In [None]:
!pip install sagemaker -qU

In [None]:
import boto3
import sagemaker
import json
from sagemaker import get_execution_role, Session
from sagemaker.pytorch import PyTorchModel
from sagemaker.serializers import DataSerializer
from sagemaker.deserializers import JSONDeserializer
role = get_execution_role()

### 2.1 Download the models and upload them to S3 as model data

In [None]:
!mkdir -p model/det
!mkdir -p model/rec/en
!mkdir -p model/dict
!mkdir -p model/cls

!wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/multilingual/en_ppocr_mobile_v2.0_det_infer.tar -O model/det/en_ppocr_mobile_v2.0_det_infer.tar
!wget https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_rec_infer.tar -O model/rec/en/en_PP-OCRv3_rec_infer.tar
!wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar -O model/cls/ch_ppocr_mobile_v2.0_cls_infer.tar

!cd model/det/ && tar xvf en_ppocr_mobile_v2.0_det_infer.tar --strip-components 1 && rm en_ppocr_mobile_v2.0_det_infer.tar
!cd model/rec/en/ && tar xvf en_PP-OCRv3_rec_infer.tar --strip-components 1 && rm en_PP-OCRv3_rec_infer.tar
!cd model/cls/ && tar xvf ch_ppocr_mobile_v2.0_cls_infer.tar --strip-components 1 && rm ch_ppocr_mobile_v2.0_cls_infer.tar

### 2.2 Upload the model to the default Amazon SageMaker - Amazon S3 bucket

In [None]:
!tar -zcvf model.tar.gz model
model_uri = sagemaker.Session().upload_data("model.tar.gz", key_prefix="ocr_model")
model_uri

### 2.3 Configure and deploy the SM Model

In [None]:
model = PyTorchModel(
    entry_point='inference.py',
    source_dir='code',
    model_data=model_uri,
    framework_version='1.11.0',
    py_version='py38',
    role=role,
)

In [None]:
predictor = model.deploy(initial_instance_count=1, instance_type="ml.t2.medium")
predictor.serializer =DataSerializer(content_type="image/png")
predictor.deserializer = JSONDeserializer()

### 2.4 Send an image to the running endpoint

In [None]:
img_path = "test_images/1.png"

In [None]:
def get_photo_text(img_path):
    data = open(img_path, 'rb').read()
    result = predictor.predict(data)
    return result

In [None]:
result = get_photo_text(img_path)
print(result)

## 3. PaddleOCR - Train model locally
First download the pretrain model.

In [None]:
#%cd PaddleOCR/
# Download the pre-trained model of en_PP-OCRv3
!wget -P ./pretrain_models/ https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_rec_train.tar
# Decompress model parameters
%cd pretrain_models/
!tar -xf en_PP-OCRv3_rec_train.tar && rm -rf en_PP-OCRv3_rec_train.tar
%cd ..

Next create your own training dataset, 

Should follow this structure:
    
    -train_data/
        -train/
            -train1.jpg
            -train2.jpg
            -...
        -test/
            -test1.jpg
            -test2.jpg
            -...
        train_list.txt
        val_list.txt

In [None]:
!cp -R ../training_data train_data

Check you have GPU capacity and run the training. You might need to reduce the batch size for training if you get a dataset error. You can modify this in configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml

In [None]:
import paddle
paddle.utils.run_check()

In [None]:
!python3 tools/train.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.pretrained_model=pretrain_models/en_PP-OCRv3_rec_train/best_accuracy Global.epoch_num=300 Global.eval_batch_step=[0,25]

In [None]:
# Review results
!python3 tools/infer_rec.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.pretrained_model=output/v3_en_mobile/best_accuracy  Global.infer_img=../test_images/1.png

In [None]:
# Export the model
!python3 tools/export_model.py -c configs/rec/PP-OCRv3/en_PP-OCRv3_rec.yml -o Global.pretrained_model=output/v3_en_mobile/best_accuracy Global.save_inference_dir=../model/rec/en

In [None]:
%cd ..

You can now upload your trained model to Amazon S3 as model data and deploy the model to Amazon Sagemaker following the steps 2.2-2.4