# AutoVQA - Getting Started Guide

A Python library for automated visual question answering (VQA) data generation and processing.

This notebook covers the following modules:
1. **Collect** - Download and prepare VQA datasets
2. **Preprocess** - Image preprocessing utilities
3. **Augment** - Generate VQA data using LLMs
4. **EDA** - Exploratory Data Analysis
5. **Filter** - Data filtering and cleaning
6. **Balance** - Dataset balancing

<!-- ## Installation

### Install from PyPI
```bash
pip install autovqa
```

### Install from source (for development)
```bash
git clone https://github.com/Ddyln/AutoVQA.git
cd AutoVQA
pip install -e ".[dev]"
``` -->

---
## 1. Collect Module

The collect module provides utilities to download VQA datasets and images.

### Entry Functions
- `download_default_data()` - This function downloads the default VQA dataset including text data and images.

In [None]:
from autovqa.collect import download_default_data

In [None]:
# Download the default dataset to a specified directory
# This will:
# 1. Download text data zip
# 2. Extract the zip file
# 3. Download images from URLs in the extracted JSON
download_default_data(output="./data")

---
## 2. Preprocess Module

The preprocess module provides image preprocessing utilities for VQA tasks.

### Preprocessing Pipeline
1. Resize - Resize while maintaining aspect ratio
2. Pad - Pad to exact target size
3. Denoise - Reduce image noise
4. Color Correction - Improve contrast using CLAHE
5. Sharpening - Apply unsharp mask
6. Normalize (optional) - Normalize pixel values

In [None]:
from autovqa.preprocess.main import preprocess_image, run_pipeline

print("Preprocess module imported successfully")

### 2.1 Preprocess Single Image

Process a single image through the preprocessing pipeline.

In [None]:
import cv2

# Preprocess a single image
processed_image = preprocess_image(
    image_path="path/to/image.jpg",
    target_size=(480, 640),  # (height, width)
    do_normalize=False
)

# Save the processed image
cv2.imwrite("processed_image.jpg", processed_image)

### 2.2 Batch Preprocess Images

Process all images in a directory.

In [None]:
# Run preprocessing pipeline on all images in a folder
run_pipeline(
    input_folder="./raw_images",
    output_folder="./processed_images",
    do_normalize=False
)

### 2.3 Individual Preprocessing Functions

You can also use individual preprocessing functions.

In [None]:
from autovqa.preprocess.image.resize import resize_image, pad_image
from autovqa.preprocess.image.denoise import denoise_image
from autovqa.preprocess.image.color_correction import color_correction
from autovqa.preprocess.image.sharpening import unsharp_mask
from autovqa.preprocess.image.normalize import normalize_image

# Example: Manual preprocessing pipeline
image = cv2.imread("image.jpg")
image = resize_image(image, target_size=(480, 640))
image = pad_image(image, target_size=(480, 640))
image = denoise_image(image)
image = color_correction(image)
image = unsharp_mask(image)

---
## 3. Augment Module

The augment module generates VQA question-answer pairs using LLMs (e.g., Gemini).

### Configuration
Before using the augment module, you need to set up a config file at:
- Linux: `~/.config/autovqa/config.toml`
- Windows: `C:\Users\<user>\AppData\Local\autovqa\autovqa\config.toml`

Example configuration is located at `./src/autovqa/augment/sample_config.toml`

In [None]:
from autovqa.augment.client import AugmentClient

### 3.1 Generate VQA Data from Images

Generate question-answer pairs for images using an LLM.

In [None]:
# Initialize the augment client
client = AugmentClient(service_name="gemini")

# Generate VQA data for all images in a folder
results = client.run_pipeline(
    image_folder_dir="./images",
    output_json_path="./output/augmented_vqa.json"
)

print(f"Generated {len(results)} QA pairs")

### 3.2 Generate for Single Image

Generate VQA data for a single image.

In [None]:
# Generate for a single image
client = AugmentClient(service_name="gemini")
response = client.generate_response("path/to/image.jpg")

if response:
    formatted = client.format_response(
        json_response=response.model_dump(),
        image_path="path/to/image.jpg"
    )
    print(formatted)
else:
	print("No response generated.")

---
## 4. EDA Module (Exploratory Data Analysis)

The EDA module performs comprehensive analysis on VQA data:
- Data cleaning and deduplication
- Feature extraction (scene types, main objects)
- Statistical analysis
- Report generation (Excel files)

### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| data | list | required | JSON data to analyze |
| output_dir | str | "report" | Directory for reports |
| generate_report | bool | True | Generate Excel reports |
| aggregation_type | str | "median" | Score aggregation (median/mean/max/min) |
| history | list | None | Track operations history |

In [None]:
import json
from autovqa import eda_pipeline

### 4.1 Load Data

In [None]:
# Load your JSON data
DATA_PATH = "../data/data_text/Data/combined_dataset/datasetQA_combined.json"

import os
if os.path.exists(DATA_PATH):
    with open(DATA_PATH, 'r', encoding='utf-8') as f:
        data = json.load(f)
    print(f"Loaded {len(data)} records")
    print(f"Sample record keys: {list(data[0].keys())}")
else:
    print(f"Data file not found at {DATA_PATH}")
    print("Please update DATA_PATH to point to your data file")
    data = []

### 4.2 Run EDA Pipeline

In [None]:
# Run EDA Pipeline
if data:
    df = eda_pipeline(
        data=data,
        output_dir="./reports",
        generate_report=True,
        aggregation_type="median"
    )
    
    print(f"EDA completed")
    print(f"DataFrame shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")

In [None]:
# View the processed DataFrame
if 'df' in dir():
    display(df.head())

In [None]:
# Check DataFrame statistics
if 'df' in dir():
    df.info()

---
## 5. Filter Module

The filter module filters data based on quality labels and thresholds.

### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| data | DataFrame | required | DataFrame to filter |
| column_names | list | None | Columns to check (auto-detects "Label" columns) |
| threshold | float | 0.5 | Minimum ratio of passed labels (0.0-1.0) |
| keep_columns | list | None | Columns to keep in result |
| show_stats | bool | True | Show filtering statistics |

In [None]:
from autovqa import filter_pipeline

### 5.1 Run Filter Pipeline

In [None]:
# Run Filter Pipeline
if 'df' in dir():
    initial_count = len(df)
    print(f"Initial records: {initial_count}")
    
    df_filtered = filter_pipeline(
        data=df,
        threshold=0.5,  # Keep records where >= 50% of labels passed
        show_stats=True
    )
    
    filtered_count = len(df_filtered)
    print(f"Records after filtering: {filtered_count}")
    print(f"Removed: {initial_count - filtered_count} records")

### 5.2 Filter with Custom Threshold

In [None]:
# Example: Stricter filtering with higher threshold
df_strict = filter_pipeline(
    data=df,
    threshold=0.7,  # Stricter: >= 70% of labels must pass
    show_stats=True
)

# Example: Lenient filtering with lower threshold
df_lenient = filter_pipeline(
    data=df,
    threshold=0.3,  # Lenient: >= 30% of labels must pass
    show_stats=True
)

---
## 6. Balance Module

The balance module ensures balanced class distributions in your dataset.

### Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| df_raw | DataFrame | required | Input DataFrame |
| numeric_columns | list | config | Columns for score computation |
| feature_columns | list | config | Columns to balance |
| reason_depth_weight | int | 4 | Weight for reason depth |
| percent_min_samples | float | 0.01 | Min percent to keep labels |
| top_percent | float | 0.9 | Keep top X% labels by frequency |
| limit_percent | float | 10 | Max percent difference between classes |
| keep_outliers | bool | True | Keep rare labels |
| output_path | str | None | CSV save path |

In [None]:
from autovqa import balancer_pipeline

### 6.1 Run Balancer Pipeline

In [None]:
# Run Balancer Pipeline
if 'df_filtered' in dir():
    pre_balance_count = len(df_filtered)
    print(f"Records before balancing: {pre_balance_count}")
    
    df_balanced = balancer_pipeline(
        df_raw=df_filtered,
        output_path=None  # Set a path to save as CSV
    )
    
    balanced_count = len(df_balanced)
    print(f"Records after balancing: {balanced_count}")
    print(f"Removed: {pre_balance_count - balanced_count} records")
else:
    print("Filtered DataFrame not found. Please run the filter pipeline first.")

### 6.2 Balance with Custom Parameters

In [None]:
# Example: Custom balancing parameters
df_balanced = balancer_pipeline(
    df_raw=df_filtered,
    percent_min_samples=0.02,  # Keep labels with >= 2% samples
    top_percent=0.85,  # Keep top 85% labels by frequency
    limit_percent=15,  # Max 15% difference between classes
    keep_outliers=False,  # Remove rare labels
    output_path="./output/balanced_data.csv"
)

---
## Complete Pipeline Example

Here is a complete end-to-end workflow combining all processing pipelines.

In [None]:
def process_vqa_data(
    data_path: str,
    output_dir: str = "./output",
    filter_threshold: float = 0.5,
    generate_reports: bool = True
):
    """
    Complete VQA data processing pipeline.
    
    Args:
        data_path: Path to JSON data file
        output_dir: Directory for outputs
        filter_threshold: Threshold for filtering (0.0-1.0)
        generate_reports: Whether to generate EDA reports
    
    Returns:
        pd.DataFrame: Processed and balanced DataFrame
    """
    import json
    import os
    from autovqa import eda_pipeline, filter_pipeline, balancer_pipeline
    
    os.makedirs(output_dir, exist_ok=True)
    
    # Step 1: Load data
    print("Step 1: Loading data...")
    with open(data_path, 'r', encoding='utf-8') as f:
        data = json.load(f)
    print(f"  Loaded {len(data)} records")
    
    # Step 2: EDA Pipeline
    print("Step 2: Running EDA pipeline...")
    df = eda_pipeline(
        data=data,
        output_dir=os.path.join(output_dir, "reports"),
        generate_report=generate_reports
    )
    print(f"  EDA complete: {len(df)} records")
    
    # Step 3: Filter Pipeline
    print("Step 3: Running filter pipeline...")
    df = filter_pipeline(
        data=df,
        threshold=filter_threshold,
        show_stats=False
    )
    print(f"  Filtering complete: {len(df)} records")
    
    # Step 4: Balancer Pipeline
    print("Step 4: Running balancer pipeline...")
    df = balancer_pipeline(
        df_raw=df,
        output_path=os.path.join(output_dir, "balanced_data.csv")
    )
    print(f"  Balancing complete: {len(df)} records")
    
    print("Pipeline complete!")
    return df

In [None]:
# Run the complete pipeline
df_final = process_vqa_data(
    data_path="path/to/your/data.json",
    output_dir="./output",
    filter_threshold=0.5,
    generate_reports=True
)

---
## Saving Results

Save your processed data in various formats.

In [None]:
# Save as CSV
df_balanced.to_csv("output.csv", index=False)

# Save as JSON
df_balanced.to_json("output.json", orient="records", force_ascii=False, indent=2)

# Save as Parquet
df_balanced.to_parquet("output.parquet", index=False)