# Kannada OCR using Tesseract and Pillow

This notebook demonstrates how to extract Kannada text from images using pytesseract and Pillow.

## Installation Commands

Run these commands in your terminal/command prompt before running this notebook:

### 1. Install Python packages
```bash
pip install pytesseract pillow jupyter
```

### 2. Install Tesseract OCR Engine

**For Windows:**
- Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
- Add Tesseract to your PATH or specify the path in code

**For macOS:**
```bash
brew install tesseract
```

**For Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install tesseract-ocr
```

### 3. Install Kannada Language Pack

**For Windows:**
- Download Kannada language data from: https://github.com/tesseract-ocr/tessdata
- Place `kan.traineddata` in Tesseract's tessdata folder

**For macOS:**
```bash
brew install tesseract-lang
```

**For Ubuntu/Debian:**
```bash
sudo apt install tesseract-ocr-kan
```

## Import Required Libraries

In [17]:
import pytesseract
from PIL import Image
import os

# For Windows users: Uncomment and modify the path below if Tesseract is not in PATH
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

print("Libraries imported successfully!")

Libraries imported successfully!


## Check Available Languages

In [18]:
# Check if Kannada language is available
available_languages = pytesseract.get_languages()
print("Available languages:")
for lang in available_languages:
    print(f"  - {lang}")

if 'kan' in available_languages:
    print("\n✅ Kannada language pack is installed!")
else:
    print("\n❌ Kannada language pack is not installed. Please install it first.")
image_path="./image.png"

Available languages:
  - afr
  - amh
  - ara
  - asm
  - aze
  - aze_cyrl
  - bel
  - ben
  - bod
  - bos
  - bre
  - bul
  - cat
  - ceb
  - ces
  - chi_sim
  - chi_sim_vert
  - chi_tra
  - chi_tra_vert
  - chr
  - cos
  - cym
  - dan
  - deu
  - deu_latf
  - div
  - dzo
  - ell
  - eng
  - enm
  - epo
  - equ
  - est
  - eus
  - fao
  - fas
  - fil
  - fin
  - fra
  - frm
  - fry
  - gla
  - gle
  - glg
  - grc
  - guj
  - hat
  - heb
  - hin
  - hrv
  - hun
  - hye
  - iku
  - ind
  - isl
  - ita
  - ita_old
  - jav
  - jpn
  - jpn_vert
  - kan
  - kat
  - kat_old
  - kaz
  - khm
  - kir
  - kmr
  - kor
  - lao
  - lat
  - lav
  - lit
  - ltz
  - mal
  - mar
  - mkd
  - mlt
  - mon
  - mri
  - msa
  - mya
  - nep
  - nld
  - nor
  - oci
  - ori
  - osd
  - pan
  - pol
  - por
  - pus
  - que
  - ron
  - rus
  - san
  - sin
  - slk
  - slv
  - snd
  - spa
  - spa_old
  - sqi
  - srp
  - srp_latn
  - sun
  - swa
  - swe
  - syr
  - tam
  - tat
  - tel
  - tgk
  - tha
  - tir
  - ton
 

## Function to Read Kannada Text from Image

In [19]:
def read_kannada_text(image_path):
    """
    Read Kannada text from an image file
    
    Args:
        image_path (str): Path to the image file
    
    Returns:
        str: Extracted Kannada text
    """
    try:
        # Open the image using Pillow
        image = Image.open(image_path)
        
        # Display image info
        print(f"Image size: {image.size}")
        print(f"Image mode: {image.mode}")
        print(f"Image format: {image.format}")
        
        # Extract text using pytesseract with Kannada language
        # Using both English and Kannada for better results
        extracted_text = pytesseract.image_to_string(image, lang='kan')
        
        return extracted_text
    
    except FileNotFoundError:
        return f"Error: Image file '{image_path}' not found."
    except Exception as e:
        return f"Error processing image: {str(e)}"

## Enhanced OCR Function with Preprocessing

In [20]:
def read_kannada_text_enhanced(image_path, preprocess=True):
    """
    Read Kannada text from an image with optional preprocessing
    
    Args:
        image_path (str): Path to the image file
        preprocess (bool): Whether to apply image preprocessing
    
    Returns:
        str: Extracted Kannada text
    """
    try:
        # Open the image
        image = Image.open(image_path)
        
        if preprocess:
            # Convert to grayscale for better OCR results
            image = image.convert('L')
            
            # Enhance contrast (optional)
            from PIL import ImageEnhance
            enhancer = ImageEnhance.Contrast(image)
            image = enhancer.enhance(2.0)  # Increase contrast
        
        # OCR configuration for better Kannada recognition
        custom_config = r'--oem 3 --psm 6'
        
        # Extract text
        extracted_text = pytesseract.image_to_string(
            image, 
            lang='kan',
            config=custom_config
        )
        
        return extracted_text
    
    except Exception as e:
        return f"Error processing image: {str(e)}"

## Example Usage

Place your Kannada image file in the same directory as this notebook and update the filename below.

In [23]:
# Example usage - replace 'kannada_image.jpg' with your image filename
image_filename = 'ima2.png'  # Change this to your image file

# Check if image file exists
if os.path.exists(image_filename):
    print(f"Processing image: {image_filename}")
    print("=" * 50)
    
    # Basic OCR
    print("Basic OCR Result:")
    basic_result = read_kannada_text(image_filename)
    print(basic_result)
    print("\n" + "=" * 50)
    
    # Enhanced OCR with preprocessing
    print("Enhanced OCR Result:")
    enhanced_result = read_kannada_text_enhanced(image_filename, preprocess=True)
    print(enhanced_result)
    
else:
    print(f"Image file '{image_filename}' not found.")
    print("Please place your Kannada image file in the same directory as this notebook.")
    print("Supported formats: JPG, PNG, TIFF, BMP, GIF")

Processing image: ima2.png
Basic OCR Result:
Image size: (1237, 466)
Image mode: RGBA
Image format: PNG
38. ಕ್ರಾ ತಿ ಅ ಗ ನ ಸತ್ತಿ

ಶೌ ಇಜಿಪ್ತದಿಂದ ಇರಾಣದಗಡಿ ವರೆಗೆ ಹಬ್ಬಿರುವ
ಈ ಮುಸ್ಲಿಂ ದೇಶಗಳ ಒಟ್ಟು ವಿಸ್ತಾರ, ರಶಿಯಾ
ಬಿಟ್ಟು ಯುರೋಪಖಂಡದಷ್ಟು .: ಇದೆ.
ಕೆ ಅದರ ಜನಸಂಖ್ಯೆ ಆ ಮಾನದಿಂದ ಬಹಳ
, ಆದುದರಿಂದ ಅಲ್ಲಿಯ ನಿವಾಸಿಗಳ ಜೀವ
ನದ ಮಟ್ಟ ಬಹಳ ಏರಿಕೆಯದಿರಬೇಕೆಂದು ಭಾವಿ
ಹಲು. ಎಡೆಯುಂಟು, ವಸ್ತುಸ್ಥಿತಿ. 'ಇದಕ್ಕೆ
ಬದ್ಧನಿರುದ್ಧವಾಗಿದೆ. ಈ ಜನರ ಜೀವನ-- ಪಾಲಿ.
ಸ್ತಾನ, ಸೀರಿಯ್ಯಾ, ಲೆಬನಾನ್‌ ಹೊರತು ಪಡಿಸಿ
ದರೆ ತಲಾ ದಿನಕ್ಕೆ ಒಂದು ಪೆನ್ಸಿನ. ಅಥವಾ
'ಒಂದಾಣೆಯ ವೆಚ್ಚದಿಂದ ನಡೆಯುತ್ತದೆ. ಇದೊಂ
ಡೆ ಅಲ್ಲಿಯ ಜನರ ಆರ್ಥಿಕ, ಸಾಂಸ್ಕೃತಿಕ ಮಟ್ಟ
ವನ್ನು ಗುರುತಿಸಲು ಸಾಕು.

ಮಧ್ಯ ಪೂರ್ನದ ಪ್ರದೇಶಗಳು «ಒಂದು
ಸಾಮಾಜಿಕ ಮತ್ತು ಅರ್ಥಿಕ ಹೊಲಸುಗೇರಿ?
ಎಂದು ಬ್ರಿಟಿಶ ವಿದೇಶ ಧೋರಣೆಯ ಮೇಲೆ
ಪ್ರಬಲ ಪ್ರಭಾನ ಬೀರುವ ಲಂಡನದ «ಚ್ಯಾಟ್‌
ಹ್ಯಾಂ ಹೌಸ'' ಸಂಘದ ಕಾಯರೊ ಗುಂಪಿನವ
"ಎಗಡಿಸಿದ ಗೌಪ್ಪ ವರದಿಯರಧಿ ಹೇಳಲಾದು.

. ಸೈತ ಉ ಟ್ರ ಶತ
ಅವರೇ " ಮಜ್ಜಿಸ್‌, ಈ ತುದಿಯ. ಇರಾಣದ

[ಮಟ್ಟಿಗೆ ಇದೆಷ್ಟು ಸತ್ಯವೊ ಆ ತುದಿಯ ಉತ್ತರ

ಆಫಿ್ರಕೆಯ ಇಜಿಪ್ತದ ವಿಷಯಕ್ಕೂ ಇವೆರಡರ
ನಡುಎಣ ನಾಡುಗಳ ವಿಷಯಕ್ಕೂ ಅಷ್ಟೇ ಸತ್ಯ.
ಇಜಿಪ್ತದಲ್ಲಿ ಶ್ರೀಮಂತ ««ಪಾಶಾಗಿರೇ.. ಸರಕಾರ,
ಬಡ ««ಫೆಲಾಹೀನ* (ರೈತ) ರಿಗೆ ದನಿಯೆತ್ತುವ
ಅವಕಾಶವಿಲ

## Batch Processing Multiple Images

In [22]:
def process_multiple_images(image_folder='.'):
    """
    Process all image files in a folder
    
    Args:
        image_folder (str): Path to folder containing images
    """
    supported_formats = ('.jpg', '.jpeg', '.png', '.tiff', '.bmp', '.gif')
    
    image_files = [f for f in os.listdir(image_folder) 
                   if f.lower().endswith(supported_formats)]
    
    if not image_files:
        print("No image files found in the specified folder.")
        return
    
    print(f"Found {len(image_files)} image file(s):")
    
    for image_file in image_files:
        print(f"\n{'='*60}")
        print(f"Processing: {image_file}")
        print(f"{'='*60}")
        
        image_path = os.path.join(image_folder, image_file)
        result = read_kannada_text_enhanced(image_path)
        
        print("Extracted Text:")
        print(result)

# Uncomment the line below to process all images in the current directory
# process_multiple_images()

## Tips for Better OCR Results

1. **Image Quality**: Use high-resolution, clear images
2. **Contrast**: Ensure good contrast between text and background
3. **Orientation**: Make sure text is properly oriented (not rotated)
4. **Noise**: Remove noise and artifacts from the image
5. **Font Size**: Larger fonts generally work better
6. **Language**: Specify the correct language (Kannada) for better accuracy

## Troubleshooting

- If you get "TesseractNotFoundError", make sure Tesseract is installed and in your PATH
- If Kannada text is not recognized, ensure the Kannada language pack is installed
- For Windows users, you may need to specify the Tesseract executable path manually
- Try different PSM (Page Segmentation Mode) values if results are poor