<div class="text_cell_render border-box-sizing rendered_html">
<div style="color:black; border: 2px solid #6f42c1; background-color:#f3e8ff; padding: 20px; border-radius: 15px; font-size: 200%; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; text-align:center; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.7);">📚Extract Text From Images📚</div>
</div>


<div style="text-align:center">
    <img src="https://lh6.googleusercontent.com/fiUeo1zFk4n3RkXzd3F1X1NjHFvFsITKtAkJ6cuggNVgGQQ0sxHR3DNEhQtdlxOtGiHZ9C52MVKHVC9CzTU77DN5gjCi_jpHpu_cR6ZZVf3PZFmLZ-K9icpEWpjZWM1yT5eXncfF6C150fQoH1jmDfQ" alt="Image">
</div>

Here are the steps without the code:

### 1. Set Up the Environment
- Install necessary libraries:
  - OpenCV
  - pytesseract

### 2. Import Required Libraries
- Import the necessary Python libraries for image processing and OCR:
  - `cv2` for OpenCV
  - `pytesseract` for OCR
  - `os` for directory and file operations

### 3. Define the Image Directory
- Specify the directory where your images are stored.

### 4. Define the Preprocessing Function
- Create a function to preprocess images:
  - Convert image to grayscale.
  - Apply Gaussian blur to reduce noise.
  - Apply adaptive thresholding to binarize the image.

### 5. Define the OCR Function
- Create a function to extract text from images using OCR:
  - Configure Tesseract OCR to handle tabular data better.

### 6. Iterate Over Images and Process Them
- Loop through each image in the directory:
  - Check if the file is an image (e.g., `.jpg` or `.png`).
  - Read the image.
  - Preprocess the image using the preprocessing function.
  - Extract text from the preprocessed image using the OCR function.
  - Save the extracted text to a separate file named after the image file.

<div class="text_cell_render border-box-sizing rendered_html">
<div style="color:black; border: 2px solid #6f42c1; background-color:#f3e8ff; padding: 20px; border-radius: 15px; font-size: 200%; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; text-align:center; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.7);"> 📂IMPORTING LIBRARIES📂 </div>
</div>

In [1]:
import numpy as np 
import pandas as pd 
import cv2
import pytesseract
import os
import re
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/tm224644d1_ex99-1img015.jpg
/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/image00045.jpg
/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/image00004.jpg
/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/tm224644d1_ex99-1img005.jpg
/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/image00013.jpg
/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/image00003.jpg
/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/image00046.jpg
/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/image00047.jpg
/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR/tm224644d1_ex99-1img028.jpg


<div class="text_cell_render border-box-sizing rendered_html">
<div class="alert alert-block alert-success" style="margin: 20px; padding: 20px; border-radius: 10px; border: 2px solid #4CAF50; background-color: #E6F7E2;">
    <b>📂 Libraries:</b> Successfully import the recquired library
</div>
</div>

In [2]:
# Path to the directory containing images
image_dir = '/kaggle/input/image-dataset-for-ocr/[Business Quant] Image dataset for OCR'

<div class="text_cell_render border-box-sizing rendered_html">
<div style="color:black; border: 2px solid #6f42c1; background-color:#f3e8ff; padding: 20px; border-radius: 15px; font-size: 200%; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; text-align:center; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.7);"> 🛠Preprocessing Image🛠 </div>
</div>

In [3]:
# Function to preprocess images
def preprocess_image(image):
    # Convert image to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Apply Gaussian blur to reduce noise
    blurred = cv2.GaussianBlur(gray, (5, 5), 0)

    # Apply adaptive thresholding to binarize the image
    thresholded = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)

    return thresholded

<div class="text_cell_render border-box-sizing rendered_html">
<div style="color:black; border: 2px solid #6f42c1; background-color:#f3e8ff; padding: 20px; border-radius: 15px; font-size: 200%; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; text-align:center; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.7);"> 🔍Extract Text🔍 </div>
</div>

In [4]:
# Function to extract text from images using OCR
def extract_text(image):
    # Use Tesseract OCR with --psm 6 to handle tabular data better
    custom_config = r'--oem 3 --psm 6'
    text = pytesseract.image_to_string(image, config=custom_config)
    return text

In [5]:
# Iterate over images in the directory
for filename in os.listdir(image_dir):
    if filename.endswith('.jpg') or filename.endswith('.png'):
        # Read the image
        image_path = os.path.join(image_dir, filename)
        image = cv2.imread(image_path)

        # Preprocess the image
        preprocessed_image = preprocess_image(image)

        # Extract text from the preprocessed image
        extracted_text = extract_text(preprocessed_image)

        # Save the extracted text to a separate file
        output_file = f"{os.path.splitext(filename)[0]}.txt"
        with open(output_file, 'w') as f:
            f.write(extracted_text)

print("Text extraction completed. Extracted text saved to individual files.")

Premature end of JPEG file
Premature end of JPEG file
Premature end of JPEG file


Text extraction completed. Extracted text saved to individual files.


<div class="text_cell_render border-box-sizing rendered_html">
<div class="alert alert-block alert-success" style="margin: 20px; padding: 20px; border-radius: 10px; border: 2px solid #4CAF50; background-color: #E6F7E2;">
    <b>📂 Extraction:</b> Successfully extract the text and Saved Successfully
</div>
</div>

<div class="text_cell_render border-box-sizing rendered_html">
<div style="color:black; border: 2px solid #6f42c1; background-color:#f3e8ff; padding: 20px; border-radius: 15px; font-size: 200%; font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; text-align:center; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.7);"> 📜Sample Of Saved File📜 </div>
</div>

In [6]:
output_file = '/kaggle/working/image00004.txt'

# Open and read the extracted text file
with open(output_file, 'r') as f:
    extracted_text = f.read()

# Print the extracted text
print(extracted_text)

Interim condensed consolidated statement of cash flows
ne
on
meecmsies
_ i
; were
SSS Sn
TE
earn a 2 2
Re ————"__>>=>====SSananq—=aaa="11
een : 5 m
EE  —_—————————_—_—_—_—_—_——=—_—===_S====———S=——aA
panibyeenrtray = is =
a ————_—[—$—[—[_—[_—[_—[_—]_—]_—]_—]_—>——=—————————>>>>>=>>>>=>=E=EC=h"“~EULD2a=S=S>=anm=_hE—aI
ono oo ,
a
= = :
eR ccc ny
Sa = 7 >
| ~( eneueltes== << —— et
Soon 7 ™ fs
OEE
cet 5 2 ~
8S
a ze = =
ed
[ene pewe leer Ta a = 7
RP R———————_—_—_[_>>——[—[—[—[—[—[—>_—[_—"_——>=>>>>>>>>>>>)—SSSSS——S==a
Romine 7 a on
OF a OF ee IN)
oss : et ma
oO —— OO
Saas = = =
Sa
PEE OE LM eT,
Root
i
ee a = =H
0  —————————————————_—_—_—_—— = = =
a A 8 85 0 $< — sO
prapesusalgaletpsivann am in 7
Det hey gt
Com a Sank mre pe — So



In [7]:
output_file = '/kaggle/working/tm224644d1_ex99-1img028.txt'

# Open and read the extracted text file
with open(output_file, 'r') as f:
    extracted_text = f.read()

# Print the extracted text
print(extracted_text)

ISAPYSAp quarterly Statement 4 2021 R ® & ® B
Services
1-04 2022 1-04 bin ane
€ milGons, unless otherwise stated —$—$ rrr
(noo1FRS) Actual Constant Actual, Actual, Constant
Curenty Currency Currency Currency Currency
Cloud and software | ° ol 5 95 -95
Servkes | 3.234 3,282] 3,374 4 3
Total segment revenue { 3,234 3,283] 3379 4 3
Cost of coud { -78 -80] -74 6 8
Cost of software licenses and support ( -18 -19] 32 43 42
Cost of cloud and sofware ( 97 99] 106 “9 7
Cost of services. | -2035 -2,062| __-2,209 -8 7
Total cost of revenue ( +2432 +2,160| -2,315 <8 7
Segment gross profit I 1,103 1,122] 1,063 4 6
Other segment expenses [ 375 -379] 418 -10 9
Segment profit (loss) | 728 744) 645 13 15
Margins.
Services gross margin (in 9) | 372 372] uS 269 2700
Segment gross margin (in %) ( 342 34.2] 225 26pp 27%pp
Segment margin (in 9) | 25 22.74 19.2 34pp 3.6pp
Due to rounding, numbers may not add up precisely.
28/35
es



In [8]:
output_file = '/kaggle/working/tm224644d1_ex99-1img005.txt'

# Open and read the extracted text file
with open(output_file, 'r') as f:
    extracted_text = f.read()

# Print the extracted text
print(extracted_text)

ISAPY Sap quarterly statement 4 2021 R ® & ® B
AV ‘ A
asl Financial Results at a Glance
| Fourth Quarter 2023 eee
IFRS Non4Frs*
i
€ million, unless otherwise stated Q42021 042020 Bi% 942021 042020 ain const
Current cloud backlog? | Nal NA Nal 9,447] 7,155 32 26
‘Thereof SAP SJAHANA Current Cloud Backlog? | Nal NA NA| 1,707 927 84 16
Cloud revenue | 2611[ 2042 23| 2611] 2044 28 24
Thereot SAP S/4HANA Cloud revenue | 329| 199 65| 329| 199 6s 61
Software licenses and support revenue | 43794538 | 4379] 4538 4 4
Cloud and software revenue | 6990] 6.579 6| 6990] 6582 6 3
Total revenue | 7981| 7,538 6| 798] 7,542 6 3
Share of more predictable revenue (in 96) ( 69| 65 Spo | 69] 6s Spo
Operating prof (loss) | 2466] 2,657 —s| 2468] 2,772 <1. -12
Profit (loss) after tax | 2487[ 2.934 -25| 2280] 2,026 13
Basic eamings per share {in €) | 1.26| 162 -23 | 1.86] 1.70 10
Number of employees (FTE, December 31) | t07415| 102,430 s| na | NA NA NA
1 For a breakdown of the individual adfustments see table “