### Requirements for 3D RGB Image Convolution

#### A) CUDA Implementation

Implement a CUDA program for 3D convolution on RGB images with three kernel variations:

1. **Kernel 1:** Basic implementation (no tiling)
2. **Kernel 2:** Tiling with block size matching input tile size
3. **Kernel 3:** Tiling with block size matching output tile size

**Program Structure:**
```
./program <input_folder_path> <output_folder_path> <batch_size> <mask_file> [stride]
```

**Technical Requirements:**
- Add appropriate padding to maintain output dimensions when stride = 1
- Process multiple images in batches (batch size provided as argument)
- Apply the mask to all three RGB channels
- (BONUS) Support variable stride values

**Mask File Format:**
- First line: dimension n (square mask)
- Next n lines: mask values (one row per line)

#### B) PyTorch Implementation

Create a Python equivalent using PyTorch's built-in convolution functions.

#### C) Performance Analysis

Conduct thorough performance profiling:

1. Compare execution times across implementations
2. Analyze the impact of declaring the mask as constant memory
3. Present results in well-organized tables
4. Prepare a comprehensive report explaining:
   - Performance comparisons between implementations
   - Analysis of memory optimizations
   - Observations about constant memory impact
   - Factors affecting convolution performance

#### Submission

Submit a complete report with experimental results, analysis, and code implementations.

In [None]:
def compile_and_run_kernel(kernel_number, input_folder, output_folder, batch_size, mask_file, stride=1, analytics=False):
    """
    Compile and run a CUDA kernel for 3D RGB image convolution.
    
    Args:
        kernel_number (int): Kernel implementation to use (1, 2, or 3)
        input_folder (str): Path to the folder containing input images
        output_folder (str): Path to the folder where processed images will be saved
        batch_size (int): Number of images to process in a batch
        mask_file (str): Path to the convolution mask file
        stride (int, optional): Stride value for convolution. Defaults to 1.
        analytics (bool, optional): Whether to run with NVIDIA profiler. Defaults to False.
    
    Returns:
        str: Path to the output folder
    """
    import os
    import time
    
    # Get current working directory
    cwd = os.getcwd()
    print(f"Current working directory: {cwd}")
    
    # Create paths
    kernel_src = os.path.join(cwd, f"cuda_kernels/kernel{kernel_number}.cu")
    kernel_exe = os.path.join(cwd, f"cuda_kernels/bin/kernel{kernel_number}.exe")
    
    # Ensure input and mask file paths are absolute
    input_folder_path = os.path.abspath(input_folder)
    output_folder_path = os.path.abspath(output_folder)
    mask_file_path = os.path.abspath(mask_file)
    
    # Create output folder if it doesn't exist
    os.makedirs(output_folder_path, exist_ok=True)
    
    # Create bin directory if it doesn't exist
    os.makedirs(os.path.dirname(kernel_exe), exist_ok=True)
    
    # Compile with appropriate flags
    print(f"Compiling kernel {kernel_number}...")
    !nvcc "{kernel_src}" -o "{kernel_exe}" --use_fast_math -O3
    
    # Print configuration details
    kernel_types = {
        1: "Basic implementation (no tiling)",
        2: "Tiling with block size matching input tile size",
        3: "Tiling with block size matching output tile size"
    }
    
    print(f"\nRunning Kernel {kernel_number}: {kernel_types.get(kernel_number, 'Unknown')}")
    print(f"Batch size: {batch_size}")
    print(f"Stride: {stride}")
    print(f"Input folder: {input_folder_path}")
    print(f"Output folder: {output_folder_path}")
    print(f"Mask file: {mask_file_path}")
    
    # Start timing
    start_time = time.time()
    
    # Run with analytics if requested, otherwise run normally
    if analytics:
        # Create analytics_Bin directory if it doesn't exist
        analytics_dir = os.path.join(cwd, "analytics_Bin")
        os.makedirs(analytics_dir, exist_ok=True)
        
        # Set profile output path inside analytics_Bin folder
        timestamp = time.strftime("%Y%m%d-%H%M%S")
        profile_output = os.path.join(analytics_dir, f"profile_k{kernel_number}_b{batch_size}_s{stride}_{timestamp}")
        
        # Run with nsys profiling
        !nsys profile --sample=none --trace=cuda --force-overwrite=true --stats=true --output="{profile_output}" "{kernel_exe}" "{input_folder_path}" "{output_folder_path}" "{batch_size}" "{mask_file_path}" "{stride}"
        print(f"Analytics data saved to {profile_output}")
    else:
        # Run normally
        !"{kernel_exe}" "{input_folder_path}" "{output_folder_path}" "{batch_size}" "{mask_file_path}" "{stride}"
    
    # End timing
    end_time = time.time()
    execution_time = end_time - start_time
    
    print(f"\nExecution completed in {execution_time:.4f} seconds")
    return output_folder_path

# Example usage:
# output = compile_and_run_kernel(1, "input_images", "output_images", 16, "masks/mask5x5.txt")
# output = compile_and_run_kernel(2, "input_images", "output_images", 16, "masks/mask5x5.txt", stride=2, analytics=True)

In [None]:
import numpy as np
import os
import time
import cv2
from pathlib import Path

def compare_image_outputs(reference_folder, output_folder, tolerance=1e-5, verbose=False):
    """
    Compare the output RGB images to check if they match within tolerance.
    
    Parameters:
    - reference_folder: Path to folder containing reference/expected RGB images
    - output_folder: Path to folder containing output RGB images from your implementation
    - tolerance: Maximum allowed difference between corresponding pixel values
    - verbose: Whether to print details about the comparison
    
    Returns:
    - True if images match within tolerance, False otherwise
    """
   # Ensure paths are Path objects
    ref_path = Path(reference_folder)
    out_path = Path(output_folder)
    
    # Get all image files in reference folder (supporting multiple formats)
    ref_files = []
    for ext in ['*.png', '*.jpg', '*.jpeg']:
        ref_files.extend(ref_path.glob(ext))
    ref_files = sorted([f for f in ref_files if f.is_file()])
    
    if not ref_files:
        print(f"❌ FAIL: No reference images found in {reference_folder}")
        return False
    
    print(f"Found {len(ref_files)} reference images to compare.")
    
    # Track overall results
    all_match = True
    total_images = 0
    matched_images = 0
    
    # Statistics collection
    max_differences = []
    mean_differences = []
    
    for ref_file in ref_files:
        # Construct path to corresponding output file
        out_file = out_path / ref_file.name
        
        # Check if output file exists
        if not out_file.exists():
            print(f"❌ FAIL: Missing output file {out_file}")
            all_match = False
            continue
        
        # Read images
        ref_img = cv2.imread(str(ref_file))
        out_img = cv2.imread(str(out_file))
        
        total_images += 1
        
        # Basic validation
        if ref_img is None or out_img is None:
            print(f"❌ FAIL: Could not read images for {ref_file.name}")
            all_match = False
            continue
            
        if ref_img.shape != out_img.shape:
            print(f"❌ FAIL: Image dimensions don't match for {ref_file.name}. " 
                  f"Expected {ref_img.shape}, got {out_img.shape}")
            all_match = False
            continue
        
        # Calculate absolute differences across all channels
        diff = np.abs(ref_img.astype(np.float32) - out_img.astype(np.float32))
        max_diff = np.max(diff)
        mean_diff = np.mean(diff)
        
        max_differences.append(max_diff)
        mean_differences.append(mean_diff)
        
        # Check if values match within tolerance
        image_match = np.all(diff <= tolerance)
        
        if image_match:
            matched_images += 1
            if verbose:
                print(f"✅ PASS: {ref_file.name} - Max diff: {max_diff:.4f}, Mean diff: {mean_diff:.4f}")
        else:
            all_match = False
            print(f"❌ FAIL: {ref_file.name} - Max diff: {max_diff:.4f}, Mean diff: {mean_diff:.4f}")
            
            if verbose:
                # Find positions of largest differences
                max_pos = np.unravel_index(np.argmax(diff), diff.shape)
                print(f"  - Largest difference at position {max_pos}: {max_diff:.4f}")
                
                # Count pixels exceeding tolerance
                exceed_count = np.sum(diff > tolerance)
                exceed_percent = 100.0 * exceed_count / diff.size
                print(f"  - {exceed_count} pixels ({exceed_percent:.2f}%) exceed tolerance")
    
    # Print overall results
    if all_match:
        print(f"✅ ALL PASS: All {total_images} images match within tolerance {tolerance}")
    else:
        print(f"⚠️ PARTIAL MATCH: {matched_images}/{total_images} images matched within tolerance {tolerance}")
    
    # Print statistics if we had valid comparisons
    if max_differences:
        print(f"Overall statistics:")
        print(f"  - Maximum difference across all images: {max(max_differences):.4f}")
        print(f"  - Average difference across all images: {np.mean(mean_differences):.4f}")
    
    return all_match

def verify_batch_processing(kernel_num, reference_folder, output_folder, 
                            stride=1, batch_size=1, tolerance=20):
    """
    Verify the output of a 3D RGB image convolution kernel against reference images.
    
    Parameters:
    - kernel_num: Kernel number (1, 2, or 3)
    - reference_folder: Folder containing reference output images
    - output_folder: Folder containing generated output images
    - stride: Stride value used for convolution
    - batch_size: Batch size used for processing
    - tolerance: Maximum allowed difference between corresponding pixel values
    """
    print(f"\nVerifying Kernel {kernel_num} output (stride={stride}, batch_size={batch_size}):")
    print(f"Comparing images in:")
    print(f"  - Reference: {reference_folder}")
    print(f"  - Output:    {output_folder}")
    
    # Run comparison
    start_time = time.time()
    result = compare_image_outputs(reference_folder, output_folder, 
                                  tolerance=tolerance, verbose=True)
    end_time = time.time()
    
    print(f"Verification completed in {end_time - start_time:.2f} seconds")
    
    if result:
        print(f"✅ Kernel {kernel_num} (stride={stride}, batch_size={batch_size}) "
              f"PASSED verification")
    else:
        print(f"❌ Kernel {kernel_num} (stride={stride}, batch_size={batch_size}) "
              f"FAILED verification")
    
    return result

# Example usage:
# verify_batch_processing(1, "reference_images/stride1", "output_images/kernel1", stride=1, batch_size=16)

### **Requriment - 1**
- kernel 1 should have no tiling


In [105]:
# Test pair 1: vector=1000, mask=3
output = compile_and_run_kernel(1, "Input_TestCases/input_images", "Output_TestCases/output_images", 16, "Input_TestCases/masks/mask9x9_blur.txt",3)
# verify_batch_processing(1, "Output_TestCases/reference_images", "Output_TestCases/output_images", stride=1, batch_size=16)



Current working directory: e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_5\Solution
Compiling kernel 1...
     unsigned int cur, limit, old_limit;
                              ^


                 stbi__uint32 idata_limit_old = idata_limit;
                              ^

        int out_size = 0;
            ^

        int delays_size = 0;
            ^

kernel1.cu
tmpxft_0000503c_00000000-10_kernel1.cudafe1.cpp
   Creating library e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_5\Solution\cuda_kernels\bin\kernel1.lib and object e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_5\Solution\cuda_kernels\bin\kernel1.exp

Running Kernel 1: Basic implementation (no tiling)
Batch size: 16
Stride: 3
Input folder: e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_5\Solution\Input_TestCases\input_images
Output folder: e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel

### **Requriment - 2**
- kernel 2 should have input tiling 


In [None]:
# Test pair 2: vector=1000, mask=3
output = compile_and_run_kernel(2, "Input_TestCases/input_images", "Output_TestCases/output_images", 16, "Input_TestCases/masks/mask9x9_blur.txt",1)

# verify_kernel_output(2, "conv_v1000_m3", "conv_v1000_m3_mask")



Current working directory: e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_5\Solution
Compiling kernel 2...
     unsigned int cur, limit, old_limit;
                              ^


                 stbi__uint32 idata_limit_old = idata_limit;
                              ^

        int out_size = 0;
            ^

        int delays_size = 0;
            ^

kernel2.cu
tmpxft_0000135c_00000000-10_kernel2.cudafe1.cpp
   Creating library e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_5\Solution\cuda_kernels\bin\kernel2.lib and object e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_5\Solution\cuda_kernels\bin\kernel2.exp

Running Kernel 2: Tiling with block size matching input tile size
Batch size: 16
Stride: 3
Input folder: e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_5\Solution\Input_TestCases\input_images
Output folder: e:\02_Learn\01_University\Senior-1 Spring\C

### **Requriment - 3**
- kernel 3 should have output tiling


In [None]:
# Test pair 3: vector=1000, mask=3
output = compile_and_run_kernel(3, "Input_TestCases/input_images", "Output_TestCases/output_images", 16, "Input_TestCases/masks/mask9x9_blur.txt",2)

# verify_kernel_output(3, "conv_v1000_m3", "conv_v1000_m3_mask")


PyTorch Implementation

In [None]:
!python PyTorch_Implementation.py Input_TestCases/input_images Output_TestCases/reference_images 16 Input_TestCases/masks/mask9x9_blur.txt --stride 3