 **Assessment Task: Designing a Data Pipeline for Braille AI Using Unstructured Documents**

This notebook implements a simplified ETL pipeline to convert scanned book PDFs into a Braille-parallel corpus for training AI models.  
The steps include:

- Data Collection
- OCR Text Extraction & Cleaning  
- Structuring Extracted Text  
- Braille Translation  
- Parallel Corpus Generation  
- Final Data Export  

# Installing Required Libraries

In [2]:
!apt install tesseract-ocr -y
!pip install pytesseract

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13


In [3]:
!pip install pdf2image Pillow

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0


In [4]:
!apt install tesseract-ocr-hin -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  tesseract-ocr-hin
0 upgraded, 1 newly installed, 0 to remove and 34 not upgraded.
Need to get 913 kB of archives.
After this operation, 1,138 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-hin all 1:4.00~git30-7274cfa-1.1 [913 kB]
Fetched 913 kB in 2s (563 kB/s)
Selecting previously unselected package tesseract-ocr-hin.
(Reading database ... 126102 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-hin_1%3a4.00~git30-7274cfa-1.1_all.deb ...
Unpacking tesseract-ocr-hin (1:4.00~git30-7274cfa-1.1) ...
Setting up tesseract-ocr-hin (1:4.00~git30-7274cfa-1.1) ...


In [5]:
!apt install poppler-utils -y

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 34 not upgraded.
Need to get 186 kB of archives.
After this operation, 697 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.8 [186 kB]
Fetched 186 kB in 2s (107 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 126106 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.8_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.8) ...
Setting up poppler-utils (22.02.0-2ubuntu0.8) ...
Processing triggers for man-db (2.10.2-1) ...


In [6]:
!apt-get update
!apt-get install liblouis-bin

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [1,683 kB]
Get:7 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,728 kB]
Get:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease [24.6 kB]
G

In [8]:
!git clone https://github.com/liblouis/liblouis.git

Cloning into 'liblouis'...
remote: Enumerating objects: 41314, done.[K
remote: Counting objects: 100% (441/441), done.[K
remote: Compressing objects: 100% (216/216), done.[K
remote: Total 41314 (delta 341), reused 228 (delta 224), pack-reused 40873 (from 2)[K
Receiving objects: 100% (41314/41314), 102.46 MiB | 24.10 MiB/s, done.
Resolving deltas: 100% (30206/30206), done.


In [9]:
!ls /content/liblouis/tables

afr-za-g1.ctb			       IPA.utb
afr-za-g2.ctb			       is-chardefs6.cti
akk-borger.utb			       is-chardefs8.cti
akk.utb				       is.ctb
ancient-languages-borger.utb	       is.tbl
ancient-languages-us.utb	       it-it-comp6.utb
ar-ar-comp8.utb			       it-it-comp8.utb
ar-ar-g1-core.uti		       it.tbl
ar-ar-g1.utb			       iu-ca-g1.ctb
ar-ar-g2.ctb			       ja-kantenji-ucs2.utb
ar-ar-math.uti			       ja-kantenji.utb
ar.tbl				       ka-in-g1.utb
as-in-g1.utb			       kannada.cti
as.tbl				       ka.utb
awa.tbl				       kha.tbl
aw-in-g1.utb			       kh-in-g1.utb
ba.utb				       kk.utb
be-in-g1.utb			       km-g1.utb
bel-comp.utb			       kmr.tbl
bel-detailed.utb		       kn.tbl
bel.utb				       ko-2006.cti
bengali.cti			       ko-2006-g1.ctb
bg.ctb				       ko-2006-g2.ctb
bg.dis				       ko-chars.cti
bg.tbl				       ko.cti
bg.utb				       ko-g1.ctb
bh.ctb				       ko-g1-rules.cti
bh.tbl				       ko-g2.ctb
bn.tbl				       ko-g2-rules.cti
bo.ctb				       kok.ctb
bo.tbl				      

In [1]:
!mkdir -p /content/data/raw_pdfs

In [10]:
import os
import json
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
import re

In [11]:
pytesseract.pytesseract.tesseract_cmd = "/usr/bin/tesseract"

# Data Extraction

**Taking 10 Samples From each PDF(data)**

In [12]:
def ocr_pdf_to_text(pdf_path, max_pages=10):
    print(f"\nProcessing: {pdf_path}")
    images = convert_from_path(pdf_path, dpi=300, first_page=1, last_page=max_pages)

    all_text = ""
    for i, img in enumerate(images):
        print(f"  OCR on page {i+1}...")
        text = pytesseract.image_to_string(img, lang='eng')

        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'[^\x00-\x7F]+', '', text)
        text = text.strip()

        all_text += text + "\n\n"

    return all_text

**Extracting Data**

In [13]:
from IPython import get_ipython
from IPython.display import display

In [14]:
pdf_dir = "/content/data/raw_pdfs"
ocr_outputs = {}

for filename in os.listdir(pdf_dir):
    if filename.lower().endswith(".pdf"):
        pdf_path = os.path.join(pdf_dir, filename)
        text = ocr_pdf_to_text(pdf_path)
        ocr_outputs[filename] = text


Processing: /content/data/raw_pdfs/orwell1984.pdf
  OCR on page 1...
  OCR on page 2...
  OCR on page 3...
  OCR on page 4...
  OCR on page 5...
  OCR on page 6...
  OCR on page 7...
  OCR on page 8...
  OCR on page 9...
  OCR on page 10...

Processing: /content/data/raw_pdfs/Animal Farm by George Orwell.pdf
  OCR on page 1...
  OCR on page 2...
  OCR on page 3...
  OCR on page 4...
  OCR on page 5...
  OCR on page 6...
  OCR on page 7...
  OCR on page 8...
  OCR on page 9...
  OCR on page 10...

Processing: /content/data/raw_pdfs/The Martian.pdf
  OCR on page 1...
  OCR on page 2...
  OCR on page 3...
  OCR on page 4...
  OCR on page 5...
  OCR on page 6...
  OCR on page 7...
  OCR on page 8...
  OCR on page 9...
  OCR on page 10...


In [15]:
output_dir = "/content/data/ocr_texts"
os.makedirs(output_dir, exist_ok=True)

for fname, content in ocr_outputs.items():
    out_path = os.path.join(output_dir, fname.replace('.pdf', '.txt'))
    with open(out_path, "w", encoding="utf-8") as f:
        f.write(content)

print("\nOCR completed. Cleaned text files saved to:", output_dir)


OCR completed. Cleaned text files saved to: /content/data/ocr_texts


# Structuring Data

In [16]:
input_txt_dir = "/content/data/ocr_texts"
output_json_path = "/content/data/structured/structured_text.json"
os.makedirs(os.path.dirname(output_json_path), exist_ok=True)

In [17]:
structured_data = []

**JSON Formating**

In [18]:
for txt_file in os.listdir(input_txt_dir):
    if txt_file.endswith(".txt"):
        base_name = txt_file.replace(".txt", "")
        txt_path = os.path.join(input_txt_dir, txt_file)

        with open(txt_path, "r", encoding="utf-8") as f:
            raw_text = f.read()

        paragraphs = [p.strip() for p in raw_text.split("\n\n") if len(p.strip()) > 30]

        for idx, para in enumerate(paragraphs):
            entry = {
                "id": f"{base_name}_pg{idx+1}",
                "source": f"{base_name}.pdf",
                "language": "en",
                "content": para
            }
            structured_data.append(entry)

In [19]:
with open(output_json_path, "w", encoding="utf-8") as f:
    json.dump(structured_data, f, indent=2)

print(f"Structured {len(structured_data)} entries into: {output_json_path}")

Structured 26 entries into: /content/data/structured/structured_text.json


# Braille Corpus

**Braille Translation**

In [20]:
ASCII_TO_UNICODE_BRAILLE = {
    'a': '⠁', 'b': '⠃', 'c': '⠉', 'd': '⠙', 'e': '⠑', 'f': '⠋', 'g': '⠛', 'h': '⠓',
    'i': '⠊', 'j': '⠚', 'k': '⠅', 'l': '⠇', 'm': '⠍', 'n': '⠝', 'o': '⠕', 'p': '⠏',
    'q': '⠟', 'r': '⠗', 's': '⠎', 't': '⠞', 'u': '⠥', 'v': '⠧', 'w': '⠺', 'x': '⠭',
    'y': '⠽', 'z': '⠵', ' ': ' ', ',': '⠂', ';': '⠆', ':': '⠒', '.': '⠲', '!': '⠖',
    '?': '⠦', '\'': '⠄', '-': '⠤', '/': '⠌', '@': '⠈', '*': '⠔', '=': '⠶', '(': '⠷',
    ')': '⠾', '+': '⠖', '$': '⠸⠎', '&': '⠯'
}

In [21]:
def ascii_to_unicode_braille(ascii_braille):
    return ''.join(ASCII_TO_UNICODE_BRAILLE.get(ch, '') for ch in ascii_braille)

In [22]:
import subprocess

def convert_to_braille(text, lang="en"):
    if lang == "en":
        table_path = "/content/liblouis/tables/en-us-g1.ctb"
    else:
        raise ValueError("Unsupported language")

    result = subprocess.run(
        ["lou_translate", table_path],
        input=text.encode("utf-8"),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )

    ascii_braille = result.stdout.decode("utf-8").strip()
    return ascii_to_unicode_braille(ascii_braille)

**Example braille conversion**

In [23]:
print(convert_to_braille("hello world"))

⠓⠑⠇⠇⠕ ⠺⠕⠗⠇⠙


**Structured Braille Corpus**

In [24]:
with open("/content/data/structured/structured_text.json", "r", encoding="utf-8") as f:
    structured_data = json.load(f)

parallel_corpus = []

for entry in structured_data:
    para = entry["content"]
    book = entry["source"].replace(".pdf", "")
    para_id = entry["id"]

    braille = convert_to_braille(para)

    parallel_corpus.append({
        "braille": braille,
        "content": para,
        "source": book,
        "id": para_id
    })

print(f"Generated {len(parallel_corpus)} Braille-aligned entries.")

Generated 26 Braille-aligned entries.


In [25]:
os.makedirs("/content/data/braille", exist_ok=True)

with open("/content/data/braille/parallel_corpus.json", "w", encoding="utf-8") as f:
    json.dump(parallel_corpus, f, ensure_ascii=False, indent=2)

print("Saved to /content/data/braille/parallel_corpus.json")

Saved to /content/data/braille/parallel_corpus.json


**Braille Data**

In [26]:
import json

with open("/content/data/braille/parallel_corpus.json", "r", encoding="utf-8") as f:
    data = json.load(f)

print(f"Total entries: {len(data)}")
print(data[0])

Total entries: 26
{'braille': '⠂⠋⠊⠗⠎⠞ ⠏⠥⠃⠇⠊⠎⠓⠑⠙ ⠊⠝ ⠁⠊⠙⠙ ⠂⠞⠓⠊⠎ ⠺⠑⠃ ⠑⠙⠊⠞⠊⠕⠝ ⠏⠥⠃⠇⠊⠎⠓⠑⠙ ⠃⠽ ⠑⠂⠃⠕⠕⠅⠎⠁⠂⠁⠙⠑⠇⠁⠊⠙⠑ ⠂⠇⠁⠎⠞ ⠥⠏⠙⠁⠞⠑⠙ ⠂⠺⠑⠙⠝⠑⠎⠙⠁⠽ ⠂⠙⠑⠉⠑⠍⠃⠑⠗ ⠁⠛ ⠃⠚⠁⠙ ⠁⠞ ⠁⠙⠃⠚ ⠂⠞⠕ ⠞⠓⠑ ⠃⠑⠎⠞ ⠕⠋ ⠕⠥⠗ ⠅⠝⠕⠺⠇⠑⠙⠛⠑ ⠞⠓⠑ ⠞⠑⠭⠞ ⠕⠋ ⠞⠓⠊⠎ ⠺⠕⠗⠅ ⠊⠎ ⠊⠝ ⠞⠓⠑ ⠂⠏⠥⠃⠇⠊⠉ ⠂⠙⠕⠍⠁⠊⠝ ⠊⠝ ⠂⠁⠥⠎⠞⠗⠁⠇⠊⠁ ⠂⠂⠓⠕⠺⠑⠧⠑⠗ ⠉⠕⠏⠽⠗⠊⠛⠓⠞ ⠇⠁⠺ ⠧⠁⠗⠊⠑⠎ ⠊⠝ ⠕⠞⠓⠑⠗ ⠉⠕⠥⠝⠞⠗⠊⠑⠎ ⠁⠝⠙ ⠞⠓⠑ ⠺⠕⠗⠅ ⠍⠁⠽ ⠎⠞⠊⠇⠇ ⠃⠑ ⠥⠝⠙⠑⠗ ⠉⠕⠏⠽⠗⠊⠛⠓⠞ ⠊⠝ ⠞⠓⠑ ⠉⠕⠥⠝⠞⠗⠽ ⠋⠗⠕⠍ ⠺⠓⠊⠉⠓ ⠽⠕⠥ ⠁⠗⠑ ⠁⠉⠉⠑⠎⠎⠊⠝⠛ ⠞⠓⠊⠎ ⠺⠑⠃⠎⠊⠞⠑ ⠂⠊⠞ ⠊⠎ ⠽⠕⠥⠗ ⠗⠑⠎⠏⠕⠝⠎⠊⠃⠊⠇⠊⠞⠽ ⠞⠕ ⠉⠓⠑⠉⠅ ⠞⠓⠑ ⠁⠏⠏⠇⠊⠉⠁⠃⠇⠑ ⠉⠕⠏⠽⠗⠊⠛⠓⠞ ⠇⠁⠺⠎ ⠊⠝ ⠽⠕⠥⠗ ⠉⠕⠥⠝⠞⠗⠽ ⠃⠑⠋⠕⠗⠑ ⠙⠕⠺⠝⠇⠕⠁⠙⠊⠝⠛ ⠞⠓⠊⠎ ⠺⠕⠗⠅ ⠑⠂⠃⠕⠕⠅⠎⠁⠂⠁⠙⠑⠇⠁⠊⠙⠑ ⠂⠞⠓⠑ ⠂⠥⠝⠊⠧⠑⠗⠎⠊⠞⠽ ⠕⠋ ⠂⠁⠙⠑⠇⠁⠊⠙⠑ ⠂⠇⠊⠃⠗⠁⠗⠽ ⠂⠥⠝⠊⠧⠑⠗⠎⠊⠞⠽ ⠕⠋ ⠂⠁⠙⠑⠇⠁⠊⠙⠑ ⠂⠎⠕⠥⠞⠓ ⠂⠁⠥⠎⠞⠗⠁⠇⠊⠁ ⠑⠚⠚⠑ ⠖⠓⠞⠞⠏⠎⠒⠌⠌⠑⠃⠕⠕⠅⠎⠲⠁⠙⠑⠇⠁⠊⠙⠑⠲⠑⠙⠥⠲⠁⠥⠌⠕⠌⠕⠗⠺⠑⠇⠇⠌⠛⠑⠕⠗⠛⠑⠌⠕⠁⠌⠊⠝⠙⠑⠭⠲⠓⠞⠍⠇⠒ ⠂⠇⠁⠎⠞ ⠥⠏⠙⠁⠞⠑⠙ ⠂⠎⠥⠝⠙⠁⠽ ⠂⠍⠁⠗⠉⠓ ⠃⠛ ⠃⠚⠁⠋ ⠁⠞ ⠁⠁⠑⠓', 'content': 'First published in 1944. This web edition published by eBooks@Adelaide. Last updated Wednesday, December 17, 2014 at 14:20. To the best of our knowledge, the text of this work is in the Public Domain in Australia. HOWEVER, copyright law varies in other countries, a

# Downloading Final Data

In [27]:
import os
from google.colab import files
import shutil

def download_folder(folder_path):
    output_filename = os.path.basename(folder_path) + ".zip"
    shutil.make_archive(output_filename[:-4], 'zip', folder_path)
    files.download(output_filename)

folder_to_download = "/content/data"
download_folder(folder_to_download)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>