# Custom Optical Character Recongnition (OCR) Model with Tensorflow

I am using the dataset from kaggle. The dataset name is "standard OCR dataset" by Abhishek Jaiswal, link to the dataset: [standard OCR dataset](https://www.kaggle.com/datasets/preatcher/standard-ocr-dataset).

Libraries Importing:
1. numpy
2. pandas
3. matplotlib
4. tensorflow
5. opencv

In [9]:
import os
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import cv2 as cv

Global variables used throughout the notebook.

In [35]:
KAGGLE_DATASET_FOLDER = "/kaggle/input/standard-ocr-dataset"
WORKING_DATASET_FOLDER = "/kaggle/working/ocr-dataset"
TRAINING_DIR = "training_data"
TEST_DIR = "testing_data"

Folder structure-
```
- standard-ocr-dataset
    - data
          - testing_data
              - 0
                  - 28310.png
                  - 28346.png
                  - ...
              - 1
              - 2
              - ...
          - training_data
              - ...
    - data2
          - testing_data
              - ...
          - training_data
              - ...
```
Merge `data` & `data2` folders and move them to `DATASET_FOLDER`.

In [51]:
os.makedirs(os.path.join(WORKING_DATASET_FOLDER, TRAINING_DIR))
os.makedirs(os.path.join(WORKING_DATASET_FOLDER, TEST_DIR))

In [26]:
def get_all_files_from_folder(path_name):
    files = []
    
    if os.path.isfile(path_name):
        files.append(path_name)
    else:
        for dirpath, dirnames, filenames in os.walk(path_name):
            for filename in filenames:
                files.append(os.path.join(dirpath, filename))
            for dirname in dirnames:
                files.extend(get_all_files_from_folder(os.path.join(dirpath, dirname)))
                
    return files

In [30]:
folder_path = os.path.join(KAGGLE_DATASET_FOLDER, "data", TRAINING_DIR)
print("Folder path: ", folder_path)
all_train_files = get_all_files_from_folder(os.path.join(KAGGLE_DATASET_FOLDER, "data", TEST_DIR))
print(f"Total training files: {len(all_test_files)}")

Folder path:  /kaggle/input/standard-ocr-dataset/data/training_data
Total training files: 2016


In [29]:
folder_path = os.path.join(KAGGLE_DATASET_FOLDER, "data", TEST_DIR)
print("Folder path: ", folder_path)
all_test_files = get_all_files_from_folder(os.path.join(KAGGLE_DATASET_FOLDER, "data", TEST_DIR))
print(f"Total test files: {len(all_test_files)}")

Folder path:  /kaggle/input/standard-ocr-dataset/data/testing_data
Total test files: 2016


In [31]:
all_train_files[0]

'/kaggle/input/standard-ocr-dataset/data/testing_data/N/28477.png'

In [52]:
# Move all training data to working dataset folder
for file in all_train_files:
    parts_of_filepath = file.split("/")
    filename = parts_of_filepath[-1]
    character = parts_of_filepath[-2]
    dest_dir = os.path.join(WORKING_DATASET_FOLDER, TRAINING_DIR, character)
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
        
    shutil.copy2(file, dest_dir)

# MOve all testing data to working dataset folder
for file in all_test_files:
    parts_of_filepath = file.split("/")
    filename = parts_of_filepath[-1]
    character = parts_of_filepath[-2]
    dest_dir = os.path.join(WORKING_DATASET_FOLDER, TEST_DIR, character)
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
        
    shutil.copy2(file, dest_dir)