<a href="https://colab.research.google.com/github/SHREYASINGHMAURYA/Text-Recognition-and-NLP-Processing-in-Native-Language/blob/main/NLPTA2SHREYA_SINGH_MAURYA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install opencv-python pytesseract nltk indic-nlp-library transformers


Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting indic-nlp-library
  Downloading indic_nlp_library-0.92-py3-none-any.whl.metadata (5.7 kB)
Collecting sphinx-argparse (from indic-nlp-library)
  Downloading sphinx_argparse-0.5.2-py3-none-any.whl.metadata (3.7 kB)
Collecting sphinx-rtd-theme (from indic-nlp-library)
  Downloading sphinx_rtd_theme-3.0.2-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting morfessor (from indic-nlp-library)
  Downloading Morfessor-2.0.6-py3-none-any.whl.metadata (628 bytes)
Collecting sphinxcontrib-jquery<5,>=4 (from sphinx-rtd-theme->indic-nlp-library)
  Downloading sphinxcontrib_jquery-4.1-py2.py3-none-any.whl.metadata (2.6 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Downloading indic_nlp_library-0.92-py3-none-any.whl (40 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.3/40.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading Morfessor-2.0.6-py3-none-any.whl

In [None]:
!sudo apt install tesseract-ocr-hin

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  tesseract-ocr-hin
0 upgraded, 1 newly installed, 0 to remove and 30 not upgraded.
Need to get 913 kB of archives.
After this operation, 1,138 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-hin all 1:4.00~git30-7274cfa-1.1 [913 kB]
Fetched 913 kB in 2s (551 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package tesseract-ocr-hin.
(Reading databa

In [None]:
import os
os.environ['TESSDATA_PREFIX'] = '/usr/share/tesseract-ocr/4.00/tessdata'

In [None]:
import cv2
import pytesseract
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
from indicnlp.tokenize import indic_tokenize
from transformers import MarianMTModel, MarianTokenizer

# Optional: specify path to tesseract if not in PATH
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# Step 1: Preprocess Image
def preprocess_image(image_path):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    denoised = cv2.fastNlMeansDenoising(gray, h=30)
    return denoised

# Step 2: OCR for Hindi
def extract_text_from_image(image):
    return pytesseract.image_to_string(image, lang='hin')

# Step 3: Normalize Hindi text
def normalize_text(text):
    factory = IndicNormalizerFactory()
    normalizer = factory.get_normalizer("hi")
    return normalizer.normalize(text)

# Step 4: Tokenization
def tokenize_text(text):
    return indic_tokenize.trivial_tokenize(text, lang='hi')

# Step 5: Translate Hindi to English
def translate_to_english(text):
    model_name = 'Helsinki-NLP/opus-mt-hi-en'
    tokenizer = MarianTokenizer.from_pretrained(model_name)
    model = MarianMTModel.from_pretrained(model_name)
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    translated = model.generate(**inputs)
    return tokenizer.decode(translated[0], skip_special_tokens=True)

# Step 6: Display pipeline
def main(image_path):
    print("\n--- Hindi OCR and NLP Pipeline ---")
    image = preprocess_image(image_path)

    print("[1] Extracting Text...")
    raw_text = extract_text_from_image(image)
    print("Raw Text:\n", raw_text)

    print("\n[2] Normalizing...")
    norm_text = normalize_text(raw_text)
    print("Normalized:\n", norm_text)

    print("\n[3] Tokenizing...")
    tokens = tokenize_text(norm_text)
    print("Tokens:\n", tokens)

    print("\n[4] Translating to English...")
    translated = translate_to_english(norm_text)
    print("Translation:\n", translated)

# Change this to the path of your handwritten Hindi image
main("/content/images.jpg")



--- Hindi OCR and NLP Pipeline ---
[1] Extracting Text...
Raw Text:
 आजादी

कही है केर अगर,
जे उड़ने में कर मदद त।
सतत है काली आगर

 

  
 

जला कर रौशन कर तू
(विकार में उल्क- कर

 

बी नए कर सा रूटिवादी
सुलझा पर के आात तू.
रत, आदमी वा हो कोई बच्चा

   


[2] Normalizing...
Normalized:
 आजादी

कही है केर अगर,
जे उड़ने में कर मदद त।
सतत है काली आगर

 

  
 

जला कर रौशन कर तू
(विकार में उल्क- कर

 

बी नए कर सा रूटिवादी
सुलझा पर के आात तू.
रत, आदमी वा हो कोई बच्चा

   


[3] Tokenizing...
Tokens:
 ['आजादी\n\nकही', 'है', 'केर', 'अगर', ',', '\nजे', 'उड़ने', 'में', 'कर', 'मदद', 'त', '।', '\nसतत', 'है', 'काली', 'आगर\n\n', '\n\n', '\n', '\n\nजला', 'कर', 'रौशन', 'कर', 'तू\n', '(', 'विकार', 'में', 'उल्क', '-', 'कर\n\n', '\n\nबी', 'नए', 'कर', 'सा', 'रूटिवादी\nसुलझा', 'पर', 'के', 'आात', 'तू', '.', '\nरत', ',', 'आदमी', 'वा', 'हो', 'कोई', 'बच्चा\n\n', '\n\x0c']

[4] Translating to English...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/813k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.06M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/304M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/304M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Translation:
 Charer said that if he could help in his fly, he's constantly burning black fireer and lit up you (option-B) over the new root solver, then you have a man, a son, or a child.
