# PDF fayllardan 7 ustunli jadvallarni aniqlash va vizualizatsiya qilish

Ushbu notebook `Contracts/xariduz/` papkasidagi PDF fayllardan 7 ustunli jadvallarni aniqlab, ularni `Contracts/table_imgs/{pdf_nomi}_table_{jadval_raqami}.png` sifatida saqlaydi va ko‘rsatadi. Kod modulli, tushunarli va xatolarni samarali boshqaradi.

## Maqsad
- `Contracts/xariduz/` papkasidagi PDF fayllarni o‘qish.
- Har bir PDF fayldan jadvallarni aniqlash va faqat 7 ustunli jadvallarni ajratib olish.
- Topilgan jadvallarni tasvir sifatida ko‘rsatish va `Contracts/table_imgs/{pdf_nomi}_table_{jadval_raqami}.png` sifatida saqlash.

## Talablar
- **Python 3.9+** muhiti.
- Kerakli paketlar:
  - `PyMuPDF` (fitz): `pip install PyMuPDF`
  - `opencv-python`: `pip install opencv-python`
  - `numpy`: `pip install numpy`
  - `Pillow`: `pip install Pillow`
  - `matplotlib`: `pip install matplotlib`
- **Papkalar tuzilishi**: PDF fayllar `Contracts/xariduz/` papkasida bo‘lishi kerak. Jadvallar rasmlari `Contracts/table_imgs/` papkasida saqlanadi.

## Jarayonning umumiy ketma-ketligi
1. Kerakli paketlarni o‘rnatish va import qilish.
2. Berilgan papkadan PDF fayllarni ro‘yxatini olish.
3. Har bir PDF fayldan jadvallarni aniqlash va 7 ustunli jadvallarni ajratib olish.
4. Topilgan jadvallarni vizualizatsiya qilish va `Contracts/table_imgs/{pdf_nomi}_table_{jadval_raqami}.png` sifatida saqlash.


## 1-qadam: Kerakli paketlarni o‘rnatish va import qilish

Bu qismda kerakli Python paketlari tekshiriladi va agar o‘rnatilmagan bo‘lsa, avtomatik o‘rnatiladi. Keyin barcha kerakli modullar import qilinadi.


In [None]:
import subprocess
import sys

# Kerakli paketlar va ularning import nomlari
required_packages = {
    'PyMuPDF': 'fitz',
    'opencv-python': 'cv2',
    'numpy': 'numpy',
    'Pillow': 'PIL',
    'matplotlib': 'matplotlib'
}

# Paketlarni tekshirish va o‘rnatish
for package, import_name in required_packages.items():
    try:
        __import__(import_name)
    except ImportError:
        print(f"{package} o‘rnatilmoqda...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Barcha modullarni import qilish
import os
import glob
import fitz  # PyMuPDF
import cv2
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt


## 2-qadam: PDF fayllarni ro‘yxatini olish

Berilgan papkadan barcha PDF fayllarni topib, ularning ro‘yxatini qaytaradi. Agar papkada xatolik yuzaga kelsa, bo‘sh ro‘yxat qaytariladi.


In [None]:
def list_pdf_files(folder_path):
    """Berilgan papkadan barcha PDF fayllarni topib qaytaradi."""
    try:
        pdf_files = glob.glob(os.path.join(folder_path, "*.pdf"))
        if not pdf_files:
            print(f"{folder_path} papkasida PDF fayllar topilmadi.")
        return pdf_files
    except Exception as e:
        print(f"Xatolik: Papkani o‘qishda muammo yuzaga keldi: {e}")
        return []


## 3-qadam: PDF fayllardan jadvallarni aniqlash

Bu funksiya PDF fayllarni sahifama-sahifa o‘qib, jadvallarni aniqlaydi va faqat 7 ustunli jadvallarni saqlaydi. Jadvallar OpenCV yordamida tasvir sifatida tahlil qilinadi.


In [None]:
def extract_tables_from_pdf(pdf_path):
    """PDF fayldan 7 ustunli jadvallarni aniqlab, ro‘yxatni qaytaradi."""
    try:
        doc = fitz.open(pdf_path)
        file_basename = os.path.basename(pdf_path).replace('.pdf', '')
        tables_found = 0
        seven_column_tables = []

        print(f"\nFayl tahlil qilinmoqda: {file_basename}")

        for page_num, page in enumerate(doc):
            print(f"Sahifa {page_num+1} tekshirilmoqda...")

            # Sahifani tasvirga aylantirish
            pix = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72))
            img_data = pix.tobytes("png")
            nparr = np.frombuffer(img_data, np.uint8)
            img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
            img_orig = img.copy()

            # Tasvirni kulrang va ikkilik formatga o‘tkazish
            gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
            thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
                                          cv2.THRESH_BINARY_INV, 11, 2)

            # Gorizontal va vertikal chiziqlarni aniqlash
            horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50, 1))
            vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 50))
            horizontal_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=1)
            vertical_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, vertical_kernel, iterations=1)
            table_mask = cv2.add(horizontal_lines, vertical_lines)

            # Jadval konturlarini aniqlash
            contours, _ = cv2.findContours(table_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
            min_area = 10000
            table_contours = [cnt for cnt in contours if cv2.contourArea(cnt) > min_area]

            for i, contour in enumerate(table_contours):
                x, y, w, h = cv2.boundingRect(contour)
                padding = 10
                x = max(0, x - padding)
                y = max(0, y - padding)
                w = min(img_orig.shape[1] - x, w + 2*padding)
                h = min(img_orig.shape[0] - y, h + 2*padding)
                table_img = img_orig[y:y+h, x:x+w]

                # Ustunlarni aniqlash
                table_gray = cv2.cvtColor(table_img, cv2.COLOR_BGR2GRAY)
                vertical_kernel_detailed = cv2.getStructuringElement(cv2.MORPH_RECT, (1, h//3))
                vertical_lines_detailed = cv2.morphologyEx(cv2.threshold(table_gray, 0, 255,
                    cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1], cv2.MORPH_OPEN, vertical_kernel_detailed)
                v_projection = np.sum(vertical_lines_detailed, axis=0)

                threshold = np.max(v_projection) * 0.3
                column_edges = np.where(v_projection > threshold)[0]
                min_distance = 20
                column_boundaries = [column_edges[0]] if len(column_edges) > 0 else []
                for edge in column_edges[1:]:
                    if edge - column_boundaries[-1] > min_distance:
                        column_boundaries.append(edge)
                num_columns = len(column_boundaries) - 1 if len(column_boundaries) > 1 else 1

                # Faqat 7 ustunli jadvallarni saqlash
                if num_columns == 7:
                    seven_column_tables.append({
                        'page': page_num + 1,
                        'table': i + 1,
                        'columns': num_columns,
                        'image': table_img,
                        'pdf_name': file_basename
                    })

                tables_found += 1
                print(f"  → Jadval topildi: Sahifa {page_num+1}, Jadval {i+1}, Ustunlar soni: {num_columns}")

        if tables_found == 0:
            print("Faylda jadvallar topilmadi.")
        else:
            print(f"Jami {tables_found} ta jadval topildi, shulardan {len(seven_column_tables)} ta 7 ustunli.")

        return seven_column_tables

    except Exception as e:
        print(f"Xatolik: Jadvallarni aniqlashda muammo yuzaga keldi: {e}")
        return []


## 4-qadam: Jadvallarni vizualizatsiya qilish

Topilgan 7 ustunli jadvallarni tasvir sifatida ko‘rsatadi va har bir jadvalni `Contracts/table_imgs/{pdf_nomi}_table_{jadval_raqami}.png` sifatida saqlaydi.


In [None]:
from pathlib import Path

def visualize_tables(tables_found):
    """Topilgan 7 ustunli jadvallarni vizualizatsiya qiladi va PNG fayl sifatida saqlaydi."""
    print(f"Jami topilgan 7 ustunli jadvallar soni: {len(tables_found)}")

    if not tables_found:
        print("Vizualizatsiya qilish uchun jadvallar topilmadi.")
        return

    # Saqlash papkasini yaratish
    save_path = Path("Contracts/table_imgs/")
    save_path.mkdir(parents=True, exist_ok=True)

    for idx, table_info in enumerate(tables_found):
        plt.figure(figsize=(15, 5))
        plt.imshow(cv2.cvtColor(table_info['image'], cv2.COLOR_BGR2RGB))
        plt.title(f"PDF: {table_info['pdf_name']}, Sahifa: {table_info['page']}, Jadval: {table_info['table']}, Ustunlar soni: {table_info['columns']}")
        plt.axis('off')
        plt.show()

        # Jadvalni fayl sifatida saqlash
        output_file = save_path / f"{table_info['pdf_name']}_table_{table_info['table']}.png"
        cv2.imwrite(str(output_file), table_info['image'])
        print(f"Jadval saqlandi: {output_file}")


## 5-qadam: Asosiy jarayon

Bu qismda yuqoridagi funksiyalar birlashtirilib, PDF fayllarni o‘qish, jadvallarni aniqlash va vizualizatsiya qilish jarayoni amalga oshiriladi.


In [None]:
def main():
    """Asosiy jarayon: PDF fayllarni o‘qish va jadvallarni aniqlash."""
    folder_path = "Contracts/xariduz/"
    pdf_files = list_pdf_files(folder_path)

    if not pdf_files:
        print("Hech qanday PDF fayl topilmadi. Jarayon to‘xtatildi.")
        return

    for pdf_file in pdf_files:
        print(f"\nFayl bilan ishlash boshlandi: {pdf_file}")
        tables_found = extract_tables_from_pdf(pdf_file)
        visualize_tables(tables_found)

if __name__ == "__main__":
    main()


## Xulosa

Ushbu notebook quyidagi afzalliklarga ega:
- **Modullilik**: Har bir qadam alohida funksiyalarda, bu kodni qayta ishlatish va tushunishni osonlashtiradi.
- **Xato boshqaruvi**: Fayl o‘qish, jadval aniqlash va vizualizatsiyadagi xatolar mustahkam boshqariladi.
- **Optimallashtirish**: OpenCV va PyMuPDF yordamida tasvirni qayta ishlash samarali amalga oshiriladi.
- **Tushunarli**: Har bir qadam uchun o‘zbekcha tushuntirishlar qo‘shildi.
- **Saqlash**: 7 ustunli jadvallar `Contracts/table_imgs/{pdf_nomi}_table_{jadval_raqami}.png` sifatida saqlanadi.

Agar qo‘shimcha savollar yoki o‘zgarishlar kerak bo‘lsa, xabar bering!
