# Barcha jadval tasvirlaridan matn chiqarib olish va toza CSV sifatida saqlash

Ushbu notebook `Contracts/table_imgs/` papkasidagi barcha PNG tasvir fayllarni o‘qib, har bir tasvirdan jadval katakchalarini ajratadi, OCR yordamida matnni chiqaradi, matnlardan `\n` (yangi qator) belgilarni olib tashlaydi va natijalarni `Contracts/table_csv/{tasvir_nomi}_table.csv` fayliga jadval ko‘rinishida saqlaydi. Kod modulli, tushunarli va xatolarni samarali boshqaradi.

## Maqsad
- `Contracts/table_imgs/` papkasidagi barcha PNG tasvir fayllarni o‘qish.
- Har bir tasvirdan gorizontal va vertikal chiziqlarni aniqlash.
- Jadvalni katakchalarga bo‘lib, har bir katakchadan matnni OCR yordamida chiqarish.
- OCR natijasidagi matnlardan `\n` belgilarni tozalash.
- Natijalarni `Contracts/table_csv/{tasvir_nomi}_table.csv` fayliga jadval ko‘rinishida saqlash.

## Talablar
- **Python 3.9+** muhiti.
- Kerakli paketlar:
  - `opencv-python`: `pip install opencv-python`
  - `numpy`: `pip install numpy`
  - `matplotlib`: `pip install matplotlib`
  - `pytesseract`: `pip install pytesseract`
  - `requests`: `pip install requests`
  - `pandas`: `pip install pandas`
- **Tesseract OCR**: O‘rnatilgan bo‘lishi kerak (masalan, Windows uchun: `D:\My PC Folder\dasturlar\teseract\tesseract.exe`).
- **OCR.space API kaliti**: (standart: `K84902735288957`).
- **Papkalar tuzilishi**: Tasvir fayllar `Contracts/table_imgs/` papkasida bo‘lishi kerak. Natijalar `Contracts/table_csv/` papkasiga saqlanadi.

## Jarayonning umumiy ketma-ketligi
1. Barcha kerakli modullarni import qilish.
2. `Contracts/table_imgs/` papkasidagi barcha PNG tasvirlarni o‘qish.
3. Har bir tasvir uchun:
   - Kulrang rangga aylantirish.
   - Gorizontal va vertikal chiziqlarni aniqlash.
   - Jadvalni katakchalarga bo‘lish.
   - Har bir katakchadan OCR yordamida matn chiqarish va `\n` belgilarni tozalash.
   - Natijalarni `Contracts/table_csv/{tasvir_nomi}_table.csv` fayliga saqlash.


## 1-qadam: Kerakli paketlarni import qilish

Bu qismda barcha kerakli modullar import qilinadi. Paketlarning o‘rnatilganligi oldindan taxmin qilinadi.


In [None]:
import cv2
import numpy as np
import matplotlib.pyplot as plt
import itertools
import requests
import pytesseract
import pandas as pd
import logging
from functools import lru_cache
import re
import glob
from pathlib import Path
import os

# Tesseract yo‘lini o‘rnatish
pytesseract.pytesseract.tesseract_cmd = r'D:\My PC Folder\dasturlar\teseract\tesseract.exe'

# Logging sozlamalari
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


## 2-qadam: Tasvir fayllarni ro‘yxatini olish

`Contracts/table_imgs/` papkasidagi barcha PNG tasvir fayllarni topib, ularning ro‘yxatini qaytaradi.


In [None]:
def list_image_files(folder_path):
    """Berilgan papkadan barcha PNG tasvir fayllarni topib qaytaradi."""
    try:
        image_files = glob.glob(os.path.join(folder_path, "*.png"))
        if not image_files:
            print(f"{folder_path} papkasida PNG fayllar topilmadi.")
        return image_files
    except Exception as e:
        print(f"Xatolik: Papkani o‘qishda muammo yuzaga keldi: {e}")
        return []


## 3-qadam: Tasvirni o‘qish va kulrang rangga aylantirish

Har bir tasvir o‘qiladi va kulrang rangga aylantiriladi. Tasvirni vizualizatsiya qilish uchun yordamchi funksiya taqdim etiladi.


In [None]:
def visualize_image(image, cmap='gray'):
    """Tasvirni ko‘rsatish uchun yordamchi funksiya."""
    plt.figure(figsize=(15, 10))
    plt.imshow(image, cmap=cmap)
    plt.axis('off')
    plt.show()

def read_and_convert_to_gray(image_path):
    """Tasvirni o‘qib, kulrang rangga aylantiradi."""
    try:
        img = cv2.imread(image_path)
        if img is None:
            raise FileNotFoundError(f"{image_path} fayli topilmadi.")
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        print(f"Tasvir o‘qildi: {image_path}")
        visualize_image(gray)
        return gray
    except Exception as e:
        print(f"Xatolik: Tasvirni o‘qishda muammo: {e}")
        return None


## 4-qadam: Gorizontal va vertikal chiziqlarni aniqlash

Jadval tasviridan gorizontal va vertikal chiziqlarni aniqlash uchun Hough transformatsiyasi va tasvirni qayta ishlash ishlatiladi.


In [None]:
def make_hv_mask(gray, canny_t1=30, canny_t2=150, hough_thresh=50, min_len=100, max_gap=20, angle_tol=5, thickness=2):
    """Gorizontal va vertikal chiziqlarni aniqlash uchun maska yaratadi."""
    try:
        edges = cv2.Canny(gray, canny_t1, canny_t2)
        raw = cv2.HoughLinesP(edges, 1, np.pi/180, hough_thresh, minLineLength=min_len, maxLineGap=max_gap)
        mask = np.zeros_like(edges)
        if raw is not None:
            for x1, y1, x2, y2 in raw[:, 0]:
                ang = abs(np.degrees(np.arctan2(y2-y1, x2-x1)))
                if ang < angle_tol or abs(ang-90) < angle_tol:
                    cv2.line(mask, (x1, y1), (x2, y2), 255, thickness)
        return mask
    except Exception as e:
        print(f"Xatolik: Maska yaratishda muammo: {e}")
        return np.zeros_like(gray)

def find_peak_lines(gray, density_frac=0.8):
    """Chiziqlarni aniqlash va guruhlash."""
    try:
        mask = make_hv_mask(gray)
        h, w = mask.shape

        # Qator va ustun zichliklarini hisoblash
        row_density = mask.sum(axis=1)
        col_density = mask.sum(axis=0)
        r_thresh = row_density.max() * density_frac
        c_thresh = col_density.max() * density_frac

        # Gorizontal chiziqlarni aniqlash
        rows_over = np.where(row_density >= r_thresh)[0]
        horiz_lines = []
        for _, group in itertools.groupby(enumerate(rows_over), key=lambda iv: iv[0] - iv[1]):
            seg = [r for _, r in group]
            y_mid = seg[len(seg)//2]
            horiz_lines.append((0, y_mid, w-1, y_mid))

        # Vertikal chiziqlarni aniqlash
        cols_over = np.where(col_density >= c_thresh)[0]
        vert_lines = []
        for _, group in itertools.groupby(enumerate(cols_over), key=lambda iv: iv[0] - iv[1]):
            seg = [c for _, c in group]
            x_mid = seg[len(seg)//2]
            vert_lines.append((x_mid, 0, x_mid, h-1))

        return horiz_lines, vert_lines
    except Exception as e:
        print(f"Xatolik: Chiziqlarni aniqlashda muammo: {e}")
        return [], []

def visualize_lines(gray, horiz_lines, vert_lines):
    """Chiziqlarni tasvirda ko‘rsatish."""
    try:
        vis = cv2.cvtColor(gray, cv2.COLOR_GRAY2RGB)
        for x1, y1, x2, y2 in horiz_lines:
            cv2.line(vis, (x1, y1), (x2, y2), (0, 255, 0), 2)
        for x1, y1, x2, y2 in vert_lines:
            cv2.line(vis, (x1, y1), (x2, y2), (0, 255, 0), 2)
        plt.quantity(figsize=(15, 10))
        plt.imshow(cv2.cvtColor(vis, cv2.COLOR_BGR2RGB))
        plt.axis('off')
        plt.show()
        print(f"Gorizontal chiziqlar: {len(horiz_lines)}")
        print(f"Vertikal chiziqlar: {len(vert_lines)}")
    except Exception as e:
        print(f"Xatolik: Chiziqlarni vizualizatsiya qilishda muammo: {e}")


## 5-qadam: Jadvalni katakchalarga bo‘lish

Aniqlangan chiziqlar yordamida jadval katakchalarga bo‘linadi.


In [None]:
def cut_cells(gray, horiz_lines, vert_lines):
    """Jadvalni katakchalarga bo‘lish."""
    try:
        h_lines = sorted(list(set([y1 for _, y1, _, _ in horiz_lines])))
        v_lines = sorted(list(set([x1 for x1, _, _, _ in vert_lines])))
        cells = []

        for i in range(len(h_lines) - 1):
            row_cells = []
            for j in range(len(v_lines) - 1):
                y1, y2 = h_lines[i], h_lines[i+1]
                x1, x2 = v_lines[j], v_lines[j+1]
                cell = gray[y1:y2, x1:x2]
                row_cells.append(cell)
            cells.append(row_cells)

        return cells
    except Exception as e:
        print(f"Xatolik: Katakchalarga bo‘lishda muammo: {e}")
        return []


## 6-qadam: Katakchalarni vizualizatsiya qilish

Ajratilgan katakchalar tasvir sifatida ko‘rsatiladi.


In [None]:
def visualize_cells(cells):
    """Katakchalarni vizualizatsiya qilish."""
    try:
        n_rows = len(cells)
        n_cols = len(cells[0]) if n_rows > 0 else 0
        fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols*2, n_rows*2))
        if n_rows == 1 and n_cols == 1:
            axes = np.array([[axes]])
        for i in range(n_rows):
            for j in range(n_cols):
                ax = axes[i, j]
                ax.imshow(cv2.cvtColor(cells[i][j], cv2.COLOR_BGR2RGB))
                ax.axis('off')
        plt.tight_layout()
        plt.show()
    except Exception as e:
        print(f"Xatolik: Katakchalarni vizualizatsiya qilishda muammo: {e}")


## 7-qadam: OCR ishlov berish va matn tozalash

Har bir katakchadan matn OCR.space API yoki Tesseract yordamida chiqariladi, `\n` belgilari tozalanadi. Matn tilini (kirill yoki lotin) aniqlash uchun optimallashtirilgan klass ishlatiladi.


In [None]:
class OCRProcessor:
    """OCR ishlov berish uchun optimallashtirilgan klass."""

    def __init__(self, api_key: str = 'K84902735288957'):
        self.api_key = api_key
        self.session = requests.Session()
        self.kirill_pattern = re.compile(r'[а-яё]', re.IGNORECASE)
        self.lotin_pattern = re.compile(r'[a-z]', re.IGNORECASE)

    @lru_cache(maxsize=128)
    def detect_script(self, text: str) -> str:
        """Matndan kirill yoki lotin harflari ustunligini aniqlaydi."""
        if not text or not text.strip():
            return 'rus'
        kirill_matches = len(self.kirill_pattern.findall(text))
        lotin_matches = len(self.lotin_pattern.findall(text))
        return 'rus' if kirill_matches > lotin_matches else 'eng'

    def preprocess_image(self, image):
        """Rasmni OCR uchun tayyorlash."""
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        else:
            gray = image
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
        enhanced = clahe.apply(gray)
        denoised = cv2.fastNlMeansDenoising(enhanced)
        kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
        sharpened = cv2.filter2D(denoised, -1, kernel)
        return sharpened

    def tesseract_ocr(self, image, config: str = '--psm 6') -> str:
        """Tesseract OCR bilan matnni chiqarish."""
        try:
            processed_image = self.preprocess_image(image)
            text = pytesseract.image_to_string(processed_image, config=config)
            return text.strip()
        except Exception as e:
            logger.error(f"Tesseract OCR xatosi: {e}")
            return ""

    def ocr_space_api(self, image, language: str) -> str:
        """OCR.space API bilan matnni chiqarish."""
        try:
            is_success, buffer = cv2.imencode(".png", image, [cv2.IMWRITE_PNG_COMPRESSION, 6])
            if not is_success:
                raise ValueError("Rasmni PNG formatga o‘tkaza olmadik.")
            files = {'image.png': buffer.tobytes()}
            data = {
                'apikey': self.api_key,
                'language': language,
                'isOverlayRequired': False,
                'scale': True,
                'OCREngine': 2
            }
            response = self.session.post('https://api.ocr.space/parse/image', files=files, data=data, timeout=30)
            result = response.json()
            if result.get('IsErroredOnProcessing', False):
                raise Exception(result.get('ErrorMessage', 'Unknown OCR error'))
            return result['ParsedResults'][0]['ParsedText'].strip()
        except Exception as e:
            logger.error(f"OCR.space API xatosi: {e}")
            return ""

    def auto_ocr(self, image, fallback_to_tesseract: bool = True) -> tuple[str, str]:
        """Avtomatik til aniqlash va OCR qilish, matnni tozalash."""
        try:
            logger.info("Til aniqlash uchun Tesseract ishlatilmoqda...")
            text_tmp = self.tesseract_ocr(image, '--psm 6 -l eng+rus')
            detected_lang = self.detect_script(text_tmp)
            logger.info(f"Aniqlangan til: {detected_lang}")
            result = self.ocr_space_api(image, detected_lang)
            # Yangi qator belgilarni tozalash
            cleaned_result = result.replace('\n', ' ').replace('\r', ' ').strip()
            return cleaned_result, detected_lang
        except Exception as e:
            logger.warning(f"OCR.space API ishlamadi: {e}")
            if fallback_to_tesseract:
                logger.info("Tesseract‘ga qaytilmoqda...")
                text_tmp = self.tesseract_ocr(image, '--psm 6 -l eng+rus')
                detected_lang = self.detect_script(text_tmp)
                tesseract_lang = 'rus' if detected_lang == 'rus' else 'eng'
                final_text = self.tesseract_ocr(image, f'--psm 6 -l {tesseract_lang}')
                # Yangi qator belgilarni tozalash
                cleaned_text = final_text.replace('\n', ' ').replace('\r', ' ').strip()
                return cleaned_text, detected_lang
            raise

# OCR funksiyasi tashqi foydalanish uchun
def auto_ocr(image, api_key: str = 'K84902735288957') -> tuple[str, str]:
    ocr = OCRProcessor(api_key)
    return ocr.auto_ocr(image)


## 8-qadam: Asosiy jarayon

Bu qismda barcha tasvirlar o‘qiladi, jadval katakchalari ajratiladi, OCR qilinadi, matnlar tozalanadi va natijalar CSV faylga saqlanadi.


In [None]:
def main():
    """Asosiy jarayon: Barcha tasvirlarni o‘qish, OCR qilish va CSV faylga saqlash."""
    image_folder = "Contracts/table_imgs/"
    output_folder = "Contracts/table_csv/"

    # Saqlash papkasini yaratish
    Path(output_folder).mkdir(parents=True, exist_ok=True)

    # Tasvir fayllarni o‘qish
    image_files = list_image_files(image_folder)
    if not image_files:
        print("Hech qanday PNG fayl topilmadi. Jarayon to‘xtatildi.")
        return

    for image_path in image_files:
        print(f"\nTasvir bilan ishlash boshlandi: {image_path}")

        # Tasvirni o‘qish va kulrang rangga aylantirish
        gray = read_and_convert_to_gray(image_path)
        if gray is None:
            continue

        # Chiziqlarni aniqlash va vizualizatsiya qilish
        horiz_lines, vert_lines = find_peak_lines(gray)
        visualize_lines(gray, horiz_lines, vert_lines)

        # Katakchalarga bo‘lish
        cells = cut_cells(gray, horiz_lines, vert_lines)
        if not cells:
            print(f"{image_path} uchun katakchalar topilmadi.")
            continue

        # Katakchalarni vizualizatsiya qilish
        visualize_cells(cells)

        # OCR bilan matn chiqarish va tozalash
        try:
            for i, row in enumerate(cells):
                for j, col in enumerate(row):
                    img = cv2.cvtColor(col, cv2.COLOR_BGR2RGB)
                    text, _ = auto_ocr(img)
                    cells[i][j] = text
        except Exception as e:
            print(f"Xatolik: {image_path} uchun OCR jarayonida muammo: {e}")
            continue

        # Natijalarni CSV faylga saqlash
        try:
            image_name = Path(image_path).stem
            output_file = Path(output_folder) / f"{image_name}_table.csv"
            df = pd.DataFrame(cells)
            df.to_csv(output_file, index=False, header=False, encoding='utf-8')
            print(f"Natijalar saqlandi: {output_file}")
            print("\nDataFrame ko‘rinishi:")
            print(df)
        except Exception as e:
            print(f"Xatolik: {output_file} ga saqlashda muammo: {e}")

if __name__ == "__main__":
    main()


## Xulosa

Ushbu notebook quyidagi afzalliklarga ega:
- **Modullilik**: Har bir qadam alohida funksiyalarda, bu kodni qayta ishlatish va tushunishni osonlashtiradi.
- **Xato boshqaruvi**: Tasvir o‘qish, chiziq aniqlash, katakchalarga bo‘lish, OCR va CSV saqlashda xatolar mustahkam boshqariladi.
- **Matn tozalash**: OCR natijasidagi matnlardan `\n` va `\r` belgilari olib tashlanadi, bu CSV fayllarni jadval ko‘rinishida toza saqlashni ta’minlaydi.
- **Optimallashtirish**: Hough transformatsiyasi va OCR jarayonlari samarali, `lru_cache` bilan tezlashtirildi.
- **Tushunarli**: Har bir qadam uchun o‘zbekcha tushuntirishlar qo‘shildi.
- **Saqlash**: Natijalar `Contracts/table_csv/{tasvir_nomi}_table.csv` sifatida saqlanadi.

Agar qo‘shimcha savollar yoki o‘zgarishlar kerak bo‘lsa, xabar bering!
