<div style="text-align: center;">
<h1>OCR project with Python</h1>
</div>

<h2>Project Overview</h2>

Optical Character Recognition (OCR) is a technology that extracts and converts text from images, scanned documents, or handwritten content into machine-readable formats. It is widely used for digitizing printed materials, automating data entry, and enabling text-based search within visual content.

In this project, we utilize the **`easyocr`** library, a lightweight and highly efficient Optical Character Recognition (OCR) solution powered by deep learning techniques. **`easyocr`** supports over 80 languages and uses advanced models like **ResNet** and **LSTM** (Long Short-Term Memory) to extract text with high accuracy, even from challenging images such as low-quality photos or non-standard fonts. 

In this project, we utilize **Tesseract**, an open-source Optical Character Recognition (OCR) engine. Tesseract is one of the most accurate and widely-used OCR libraries, supporting over 100 languages. It works by recognizing characters in images and converting them into editable text, and is particularly effective for documents and scanned images. With its integration into Python via the `pytesseract` library, Tesseract provides a powerful tool for extracting text from images in various formats, making it an essential choice for many OCR tasks.


<h2>LSTM</h2>

A Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) designed to process sequential data by capturing both short-term and long-term dependencies. Unlike traditional RNNs, LSTMs use a unique architecture with memory cells and gates (input, forget, and output) that regulate the flow of information. This allows them to retain relevant information over extended sequences while avoiding issues like vanishing gradients, making them ideal for tasks such as speech recognition, text generation, and time-series analysis.

![LSTM Scheme](\lstm_scheme.jpg)


Reference: [Understanding LSTM and Its Diagrams](https://blog.mlreview.com/understanding-lstm-and-its-diagrams-37e2f46f1714)

<h2>Installation Instructions:</h2>
Before running this notebook, you need to install the required library, **EasyOCR**. You can install it by running the following command in a code cell:

In [None]:
!pip install easyocr
!pip install scikit-learn
!pip install pytesseract
!pip install Pillow

<h2 style="text-align: center;">EasyOCR</h2>
<h3> Naive Approach for OCR Text Extraction </h3>

In this section, we demonstrate a **naive approach** to Optical Character Recognition (OCR) using the **EasyOCR** library. The idea is to extract text from images using EasyOCR's built-in methods without any additional optimization or preprocessing. The following steps are involved:

1. **Image and Text Files Setup**: We define the paths to the images (`test_OCR_1.jpg` and `test_OCR_2.jpg`) and their corresponding expected translations stored in text files (`test_OCR_trad_1.txt` and `test_OCR_trad_2.txt`).
2. **Reading Translations**: We read the translations stored in the text files to compare with the OCR results.
3. **OCR Initialization and Execution**: We initialize an EasyOCR reader for both French (`fr`) and English (`en`) and use it to extract text from the two images.
4. **Text Extraction and Formatting**: For each image, we extract the detected text and accumulate it in separate variables (`res1_text` and `res2_text`).
5. **Saving the Result**: Finally, we append the OCR results to a file (`data.txt`) for later evaluation.


In [46]:
import easyocr

def process_ocr_image_naive(image_path, output_path, languages=['fr', 'en']):
    """
    Processes OCR on a single image and saves the extracted text to an output file.
    
    Args:
    - image_path (str): File path to the input image for OCR processing.
    - output_path (str): File path where the extracted text will be saved.
    - languages (list of str, optional): List of languages for the OCR reader (default is ['fr', 'en']).
    
    Returns:
    - None
    
    Notes:
    - This function reads text from the image using EasyOCR and writes the extracted text
      to the specified output file in 'utf-8' encoding.
    """
    reader = easyocr.Reader(languages)
    
    results = reader.readtext(image_path)
    
    extracted_text = "\n".join([text for _, text, _ in results])
    
    with open(output_path, "a", encoding="utf-8") as file:
        file.write(extracted_text)


In [47]:
test_OCR_1 = "../OCR_Items/test_OCR_1.jpg"
test_OCR_trad_1 = "../OCR_Items/test_OCR_trad_1.txt"
test_OCR_2 = "../OCR_Items/test_OCR_2.jpg"
test_OCR_trad_2 = "../OCR_Items/test_OCR_trad_2.txt"
path_naive_approach_test_ocr_1 = "./EasyOCR_Result/naive_test_ocr_1.txt"
path_naive_approach_test_ocr_2 = "./EasyOCR_Result/naive_test_ocr_2.txt"

with open(test_OCR_trad_1, "r",encoding="utf-8") as fichier:
	traduction_1 = fichier.read()

with open(test_OCR_trad_2, "r",encoding="utf-8") as fichier:
	traduction_2 = fichier.read()


process_ocr_image_naive(test_OCR_1, path_naive_approach_test_ocr_1)

process_ocr_image_naive(test_OCR_2, path_naive_approach_test_ocr_2)


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


<h3> Comparing OCR Text with Reference Text </h3>

In this section, we will create a function that compares the output from the Optical Character Recognition (OCR) with a reference text. The goal is to evaluate how similar the extracted text is to the original text. 

We will use a **Cosine Similarity** approach to measure the percentage of similarity between the two texts. Cosine Similarity calculates the cosine of the angle between two vectors, representing the two texts. A higher cosine similarity value indicates that the two texts are more alike.

To transform the texts into numerical vectors, we will use the **TF-IDF** (Term Frequency-Inverse Document Frequency) method. TF-IDF is a statistical measure used to evaluate how important a word is in a document relative to a collection of documents (or corpus). It helps in reducing the influence of commonly occurring words (like "the", "is", etc.) while emphasizing the more meaningful words in the text.

- **Term Frequency (TF)**: Measures how frequently a term appears in a document.
- **Inverse Document Frequency (IDF)**: Measures how important a term is by calculating the inverse of how often it appears across all documents.

For more detailed information about TF-IDF, you can refer to this [Wikipedia page](https://fr.wikipedia.org/wiki/TF-IDF).

By the end of this comparison, we will have a similarity percentage that indicates how closely the OCR text matches the reference text.


In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def compare_texts(refText, textToCompare):
    """
    Compares two texts and calculates their similarity percentage using 
    the Cosine Similarity and TF-IDF (Term Frequency-Inverse Document Frequency) method.
    
    Args:
    - refText (str): Reference text.
    - textToCompare (str): Text to compare with the reference text.
    
    Returns:
    - float: The percentage similarity between the two texts (0 to 100).
    
    Notes:
    - This version handles multi-line texts by converting them into a single uniform line.
    - This version converts both texts to lowercase for case-insensitive comparison.
    - This version does not use any stop words for comparison.
    """
    
    refText = " ".join(refText.splitlines()).lower()
    textToCompare = " ".join(textToCompare.splitlines()).lower()

    tfidf_vectorizer = TfidfVectorizer()

    tfidf_matrix = tfidf_vectorizer.fit_transform([refText, textToCompare])

    cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])

    similarity_percentage = round(cosine_sim[0][0] * 100, 2)

    return similarity_percentage


In [49]:
with open(path_naive_approach_test_ocr_1, "r",encoding="utf-8") as fichier:
	naive_approach_test_ocr_1 = fichier.read()

with open(path_naive_approach_test_ocr_1, "r",encoding="utf-8") as fichier:
	naive_approach_test_ocr_2 = fichier.read()

print(f"The similarity rate for text 1 is : {compare_texts(traduction_1,naive_approach_test_ocr_1)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_2,naive_approach_test_ocr_2)} / 100")

The similarity rate for text 1 is : 1.67 / 100
The similarity rate for text 2 is : 0.0 / 100


<h3> Processing Noisy Images with OCR </h3>

<h4>Objective</h4>
This section focuses on handling noisy images to extract text using Optical Character Recognition (OCR). Noise in images, such as graininess or blurriness, can hinder OCR accuracy. To address this, we preprocess the images by reducing noise before performing OCR.

<h4> Steps Involved </h4>
1. <b>Reading the Image</b>:
   - The noisy image is loaded in grayscale using OpenCV. Grayscale simplifies the image processing pipeline by working with a single color channel.

2. <b>Reducing Noise</b>:
   - A <b>Gaussian Blur</b> filter is applied to the image. This filter smoothens the image by reducing high-frequency noise, which can significantly improve OCR performance.

3. <b>Text Extraction with EasyOCR</b>:
   - We use EasyOCR, a robust and multilingual OCR library, to detect and extract text from the preprocessed image. EasyOCR works with multiple languages and returns both the detected text and its confidence score.

4. <b>Saving the Extracted Text</b>:
   - The extracted text is written to a file for further analysis or use. This ensures that the results are reproducible and accessible for subsequent steps in the pipeline.

<h4> Why Use Gaussian Blur? </h4>
Gaussian blur is a widely used preprocessing technique in computer vision. It helps:<br>
- Reduce image noise and detail, making text regions stand out more clearly.<br>
- Improve the accuracy of OCR by eliminating small, irrelevant artifacts in the image.<br>

<h4> Example Use Case </h4>
Imagine you have a noisy scanned document or a photograph of a sign with visual artifacts. By applying this preprocessing approach:<br>
- You reduce the noise using Gaussian blur.<br>
- Extract the text using OCR with higher accuracy.<br>
- Save the transcription for further use, such as comparison with reference texts.<br>

<h4> Code Explanation </h4>
The function `process_noisy_image` takes:
- The path to the input image.<br>
- The path to save the extracted text.
- Optional parameters for specifying OCR languages and Gaussian blur settings.<br>

This function ensures the noisy image is processed efficiently and provides a clean transcription of the text contained in the image.

<h4>References</h4>
<ul>
    <li>EasyOCR Documentation: <a href="https://www.jaided.ai/easyocr/" target="_blank">https://www.jaided.ai/easyocr/</a></li>
    <li>Gaussian Blur Explanation: <a href="https://en.wikipedia.org/wiki/Gaussian_blur" target="_blank">https://en.wikipedia.org/wiki/Gaussian_blur</a></li>
</ul>



In [50]:
import cv2
import easyocr

def process_noisy_image(input_image_path, output_text_path, languages=['en'], blur_kernel=(5, 5)):
    """
    Processes a noisy image by applying Gaussian blur and performs OCR to extract text,
    saving the result to a text file.

    Args:
    - input_image_path (str): Path to the input image file.
    - output_text_path (str): Path to the output text file where OCR results will be saved.
    - languages (list of str, optional): List of languages for OCR processing (default is ['en']).
    - blur_kernel (tuple, optional): Kernel size for Gaussian blur (default is (5, 5)).

    Returns:
    - None

    Notes:
    - The function reads the image, applies Gaussian blur to reduce noise, and performs OCR.
    - The extracted text is saved in UTF-8 encoding to the specified output file.
    """
    img = cv2.imread(input_image_path, cv2.IMREAD_GRAYSCALE)
    if img is None:
        raise FileNotFoundError(f"Image not found at path: {input_image_path}")
    
    blurred_img = cv2.GaussianBlur(img, blur_kernel, 0)
    
    reader = easyocr.Reader(languages)
    
    results = reader.readtext(blurred_img)
    
    extracted_text = "\n".join([text for _, text, _ in results])
    
    with open(output_text_path, "w", encoding="utf-8") as file:
        file.write(extracted_text)


In [51]:
path_GB_approach_test_ocr_1 = "./EasyOCR_Result/GB_test_ocr_1.txt"
path_GB_approach_test_ocr_2 = "./EasyOCR_Result/GB_test_ocr_2.txt"

process_noisy_image(test_OCR_1, path_GB_approach_test_ocr_1)
process_noisy_image(test_OCR_2, path_GB_approach_test_ocr_2)

with open(path_GB_approach_test_ocr_1, "r",encoding="utf-8") as fichier:
	noisy_image_ocr_1 = fichier.read()

with open(path_GB_approach_test_ocr_2, "r",encoding="utf-8") as fichier:
	noisy_image_ocr_2 = fichier.read()

print(f"The similarity rate for text 1 is : {compare_texts(traduction_1,noisy_image_ocr_1)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_2,noisy_image_ocr_2)} / 100")


Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


The similarity rate for text 1 is : 4.6 / 100
The similarity rate for text 2 is : 0.0 / 100


<h3> Improvement </h3>

The goal of this function is to:<br>
1. Load an image containing handwritten text.<br>
2. Apply image preprocessing to make the text more readable for OCR (Optical Character Recognition).<br>
3. Use <b>EasyOCR</b> to extract the text.<br>
4. Save the extracted text to a file.<br>

In [52]:
import cv2
import easyocr

def preprocess_and_process_image(input_image_path, output_text_path):
    """
    Preprocesses an image of handwritten text and extracts the text using EasyOCR.

    Args:
    - input_image_path (str): Path to the input image.
    - output_text_path (str): Path to save the transcribed text.

    Returns:
    - None

    Steps:
    1. Load the image and preprocess it for handwritten text.
    2. Apply EasyOCR to extract text from the processed image.
    3. Save the extracted text to the specified file.
    """
    img = cv2.imread(input_image_path, 0)
    
    binary_img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                                       cv2.THRESH_BINARY, 11, 2)
    
    denoised_img = cv2.medianBlur(binary_img, 3)
    
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
    processed_img = cv2.morphologyEx(denoised_img, cv2.MORPH_CLOSE, kernel)


    reader = easyocr.Reader(['en', 'fr'])
    results = reader.readtext(processed_img)

    extracted_text = ""
    for (_, text, _) in results:
        extracted_text += text + "\n"

    with open(output_text_path, "w", encoding="utf-8") as output_file:
        output_file.write(extracted_text)


In [53]:
path_PPI_approach_test_ocr_1 = "./EasyOCR_Result/PPI_test_ocr_1.txt"
path_PPI_approach_test_ocr_2 = "./EasyOCR_Result/PPI_test_ocr_2.txt"

preprocess_and_process_image(test_OCR_1, path_PPI_approach_test_ocr_1)
preprocess_and_process_image(test_OCR_2, path_PPI_approach_test_ocr_2)

with open(path_PPI_approach_test_ocr_1, "r",encoding="utf-8") as fichier:
	ppi_image_ocr_1 = fichier.read()

with open(path_PPI_approach_test_ocr_2, "r",encoding="utf-8") as fichier:
	ppi_image_ocr_2 = fichier.read()

print(f"The similarity rate for text 1 is : {compare_texts(traduction_1,ppi_image_ocr_1)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_2,ppi_image_ocr_2)} / 100")

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


The similarity rate for text 1 is : 0.0 / 100
The similarity rate for text 2 is : 0.0 / 100


<h3>Testing on a Non-Handwritten Image </h3>

In this section, we will perform a test using an image that does not contain handwritten text, but rather printed text. This will allow us to observe the differences in the OCR results when applied to non-handwritten text.

<h4> Objective </h4>
The goal of this test is to compare the performance of the EasyOCR model on:<br>
1. <b>Handwritten text images</b>: Which we have been processing in the previous steps.<br>
2. <b>Printed text images</b>: To observe how the preprocessing steps affect the recognition accuracy when the text is not handwritten but printed.<br>

<h4> Expected EasyOCR_Result </h4>
- <b>Handwritten text</b>: Handwritten text often varies in style, spacing, and clarity, which can make recognition more difficult for OCR models, even with preprocessing techniques.<br>
- <b>Printed text</b>: Printed text, being more uniform and clear, is typically easier for OCR models to recognize, and we expect better results with less need for heavy preprocessing.<br>

By running the OCR on a printed text image, we will be able to directly compare the effectiveness of our preprocessing steps on different types of text.


In [54]:
path_trad = "../OCR_Items/test_image.txt"
test_OCR_3 = "../OCR_Items/test_image.jpeg"


with open(path_trad, "r",encoding="utf-8") as fichier:
	traduction_3 = fichier.read()


path_NHW_naive_approach_test_ocr = "./EasyOCR_Result/NHW_naive_test_ocr.txt"
path_NHW_GB_approach_test_ocr = "./EasyOCR_Result/NHW_GB_test_ocr.txt"
path_NHW_PPI_approach_test_ocr = "./EasyOCR_Result/NHW_PPI_test_ocr.txt"

process_ocr_image_naive(test_OCR_3, path_NHW_naive_approach_test_ocr)
process_noisy_image(test_OCR_3, path_NHW_GB_approach_test_ocr)
preprocess_and_process_image(test_OCR_3, path_NHW_PPI_approach_test_ocr)

with open(path_trad, "r",encoding="utf-8") as fichier:
	traduction_3 = fichier.read()
	
with open(path_NHW_naive_approach_test_ocr, "r",encoding="utf-8") as fichier:
	NHW_Naive = fichier.read()

with open(path_NHW_GB_approach_test_ocr, "r",encoding="utf-8") as fichier:
	NHW_GB = fichier.read()

with open(path_NHW_PPI_approach_test_ocr, "r",encoding="utf-8") as fichier:
	NHW_PPI = fichier.read()

print(f"Naive approach : {compare_texts(traduction_3,NHW_Naive)} / 100")
print(f"Gaussian Blur : {compare_texts(traduction_3,NHW_GB)} / 100")
print(f"Preprocess : {compare_texts(traduction_3,NHW_PPI)} / 100")

Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU.


Naive approach : 81.02 / 100
Gaussian Blur : 74.08 / 100
Preprocess : 88.2 / 100


<h2>OCR with Tesseract</h2>

In the previous section, we used **EasyOCR**, a lightweight and efficient library for Optical Character Recognition (OCR). This time, we will perform the same task using **Tesseract**, one of the most popular and powerful open-source OCR engines.

The goal is to compare the performance of these two libraries to determine which one delivers the best results in terms of accuracy, speed, and flexibility under various conditions (e.g., low-quality images, handwritten text, etc.).

We will follow the same process as before by extracting text from an image and saving it to an output file. However, this time, we will use **`pytesseract`**, a Python wrapper for **Tesseract**, to perform the OCR on the images.

By comparing the results from **EasyOCR** and **Tesseract**, we will gain a better understanding of the strengths and limitations of each library and make an informed decision for future OCR tasks.


In [60]:
import pytesseract
from PIL import Image

def process_ocr_image_naive_tesseract(image_path, output_path, languages=['fra', 'eng']):
    """
    Processes OCR on a single image and saves the extracted text to an output file.
    
    Args:
    - image_path (str): File path to the input image for OCR processing.
    - output_path (str): File path where the extracted text will be saved.
    - languages (list of str, optional): List of languages for the OCR reader (default is ['fra', 'eng']).
    
    Returns:
    - None
    
    Notes:
    - This function reads text from the image using Tesseract and writes the extracted text
      to the specified output file in 'utf-8' encoding.
    """
    
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' 
    
    img = Image.open(image_path)
    
    extracted_text = pytesseract.image_to_string(img, lang='+'.join(languages))
    
    with open(output_path, "a", encoding="utf-8") as file:
        file.write(extracted_text)



In [65]:
import cv2
import pytesseract
from PIL import Image

def process_noisy_image_tesseract(input_image_path, output_text_path, languages=['eng'], blur_kernel=(5, 5)):
    """
    Processes a noisy image by applying Gaussian blur and performs OCR to extract text,
    saving the result to a text file using Tesseract OCR.

    Args:
    - input_image_path (str): Path to the input image file.
    - output_text_path (str): Path to the output text file where OCR results will be saved.
    - languages (list of str, optional): List of languages for OCR processing (default is ['eng']).
    - blur_kernel (tuple, optional): Kernel size for Gaussian blur (default is (5, 5)).

    Returns:
    - None

    Notes:
    - The function reads the image, applies Gaussian blur to reduce noise, and performs OCR using Tesseract.
    - The extracted text is saved in UTF-8 encoding to the specified output file.
    """
    
    # Read the image in grayscale
    img = cv2.imread(input_image_path, cv2.IMREAD_GRAYSCALE)
    if img is None:
        raise FileNotFoundError(f"Image not found at path: {input_image_path}")
    
    # Apply Gaussian blur to reduce noise
    blurred_img = cv2.GaussianBlur(img, blur_kernel, 0)
    
    # Convert the image from OpenCV format (BGR) to Pillow format (RGB)
    pil_img = Image.fromarray(blurred_img)
    
    # Perform OCR using Tesseract
    extracted_text = pytesseract.image_to_string(pil_img, lang='+'.join(languages))
    
    # Save the extracted text to a file
    with open(output_text_path, "w", encoding="utf-8") as file:
        file.write(extracted_text)


In [68]:
import cv2
import pytesseract
from PIL import Image

def preprocess_and_process_image_tesseract(input_image_path, output_text_path):
    """
    Preprocesses an image of handwritten text and extracts the text using Tesseract OCR.

    Args:
    - input_image_path (str): Path to the input image.
    - output_text_path (str): Path to save the transcribed text.

    Returns:
    - None

    Steps:
    1. Load the image and preprocess it for handwritten text.
    2. Apply Tesseract OCR to extract text from the processed image.
    3. Save the extracted text to the specified file.
    """
    
    img = cv2.imread(input_image_path, 0)
    if img is None:
        raise FileNotFoundError(f"Image not found at path: {input_image_path}")
    
    binary_img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                                       cv2.THRESH_BINARY, 11, 2)
    
    denoised_img = cv2.medianBlur(binary_img, 3)
    
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
    processed_img = cv2.morphologyEx(denoised_img, cv2.MORPH_CLOSE, kernel)

    pil_img = Image.fromarray(processed_img)
    
    extracted_text = pytesseract.image_to_string(pil_img, lang='eng+fra')
    
    with open(output_text_path, "w", encoding="utf-8") as output_file:
        output_file.write(extracted_text)


In [61]:
test_OCR_1 = "../OCR_Items/test_OCR_1.jpg"
test_OCR_2 = "../OCR_Items/test_OCR_2.jpg"
test_OCR_3 = "../OCR_Items/test_image.jpeg"

test_OCR_trad_1 = "../OCR_Items/test_OCR_trad_1.txt"
test_OCR_trad_2 = "../OCR_Items/test_OCR_trad_2.txt"
test_OCR_trad_3 = "../OCR_Items/test_image.txt"

In [66]:
path_naive_approach_test_ocr_1 = "./Tesseract_Result/naive_test_ocr_1.txt"
path_naive_approach_test_ocr_2 = "./Tesseract_Result/naive_test_ocr_2.txt"
path_NHW_naive_approach_test_ocr = "./Tesseract_Result/NHW_naive_test_ocr.txt"

process_ocr_image_naive_tesseract(test_OCR_1,path_naive_approach_test_ocr_1)
process_ocr_image_naive_tesseract(test_OCR_2,path_naive_approach_test_ocr_2)
process_ocr_image_naive_tesseract(test_OCR_3,path_NHW_naive_approach_test_ocr)

with open(path_naive_approach_test_ocr_1, "r",encoding="utf-8") as fichier:
	test_1 = fichier.read()

with open(path_naive_approach_test_ocr_2, "r",encoding="utf-8") as fichier:
	test_2 = fichier.read()

with open(path_NHW_naive_approach_test_ocr, "r",encoding="utf-8") as fichier:
	test_3 = fichier.read()

print(f"The similarity rate for text 1 is : {compare_texts(traduction_1,test_1)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_2,test_2)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_3,test_3)} / 100")


The similarity rate for text 1 is : 0.0 / 100
The similarity rate for text 2 is : 0.0 / 100
The similarity rate for text 2 is : 95.96 / 100


In [67]:
path_GB_approach_test_ocr_1 = "./Tesseract_Result/GB_test_ocr_1.txt"
path_GB_approach_test_ocr_2 = "./Tesseract_Result/GB_test_ocr_2.txt"
path_NHW_GB_approach_test_ocr = "./Tesseract_Result/NHW_GB_test_ocr.txt"

process_noisy_image_tesseract(test_OCR_1,path_GB_approach_test_ocr_1)
process_noisy_image_tesseract(test_OCR_2,path_GB_approach_test_ocr_2)
process_noisy_image_tesseract(test_OCR_3,path_NHW_GB_approach_test_ocr)

with open(path_GB_approach_test_ocr_1, "r",encoding="utf-8") as fichier:
	test_1 = fichier.read()

with open(path_GB_approach_test_ocr_2, "r",encoding="utf-8") as fichier:
	test_2 = fichier.read()

with open(path_NHW_GB_approach_test_ocr, "r",encoding="utf-8") as fichier:
	test_3 = fichier.read()

print(f"The similarity rate for text 1 is : {compare_texts(traduction_1,test_1)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_2,test_2)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_3,test_3)} / 100")


The similarity rate for text 1 is : 0.0 / 100
The similarity rate for text 2 is : 0.0 / 100
The similarity rate for text 2 is : 95.96 / 100


In [69]:
path_PPI_approach_test_ocr_1 = "./Tesseract_Result/PPI_test_ocr_1.txt"
path_PPI_approach_test_ocr_2 = "./Tesseract_Result/PPI_test_ocr_2.txt"
path_NHW_PPI_approach_test_ocr = "./Tesseract_Result/NHW_PPI_test_ocr.txt"

preprocess_and_process_image_tesseract(test_OCR_1,path_PPI_approach_test_ocr_1)
preprocess_and_process_image_tesseract(test_OCR_2,path_PPI_approach_test_ocr_2)
preprocess_and_process_image_tesseract(test_OCR_3,path_NHW_PPI_approach_test_ocr)

with open(path_PPI_approach_test_ocr_1, "r",encoding="utf-8") as fichier:
	test_1 = fichier.read()

with open(path_PPI_approach_test_ocr_2, "r",encoding="utf-8") as fichier:
	test_2 = fichier.read()

with open(path_NHW_PPI_approach_test_ocr, "r",encoding="utf-8") as fichier:
	test_3 = fichier.read()

print(f"The similarity rate for text 1 is : {compare_texts(traduction_1,test_1)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_2,test_2)} / 100")
print(f"The similarity rate for text 2 is : {compare_texts(traduction_3,test_3)} / 100")


The similarity rate for text 1 is : 0.0 / 100
The similarity rate for text 2 is : 0.0 / 100
The similarity rate for text 2 is : 95.96 / 100
