# 1D Convolution Kernels and Report

## Task Overview
You are required to write three CUDA kernels that perform 1D convolution, along with a report analyzing their performance.

### Kernel Implementations:
1. **Kernel 1**: No tiling.
2. **Kernel 2**: Uses output tiling.
3. **Kernel 3**: Uses input tiling.

## Report Requirements
The report should compare the performance of the three kernels, highlighting differences in execution time and efficiency.

## Command-Line Arguments
Each kernel should accept three command-line arguments:
- **Input file** (vector data)
- **Mask file** (convolution mask)
- **Output file** (resulting vector)

### Example Usage:
```bash
./kernel1 inputfile.txt mask.txt outputfile.txt
```

## Input File Format
The input file contains:
1. An integer `N` (size of the vector).
2. `N` float numbers representing the vector.

**Example:**
```
5
1 2 3 4 5
```

## Mask File Format
The mask file contains:
1. An integer `M` (size of the mask).
2. `M` float numbers representing the mask.

**Example:**
```
3
0 1 0
```

## Output File Format
The output file should contain `N` float numbers representing the resulting vector after convolution.

**Example Output:**
```
1 2 3 4 5
```

## Submission
Submit a compressed file named with your student code. The archive should contain:
- The three CUDA kernel implementations.
- The report analyzing their performance.
---

In [11]:
def compile_and_run_kernel(kernel_number, input_file_name, mask_file_name, analytics=False):
    import os
    
    # Get current working directory
    cwd = os.getcwd()
    print(f"Current working directory: {cwd}")
    # Create paths
    kernel_src = os.path.join(cwd, f"cuda_kernels/kernel{kernel_number}.cu")
    kernel_exe = os.path.join(cwd, f"cuda_kernels/bin/kernel{kernel_number}.exe")
    
    # Ensure input and mask file paths are correct
    input_file = os.path.join(cwd, f"Input_TestCases/{input_file_name}.txt")
    mask_file = os.path.join(cwd, f"Input_TestCases/{mask_file_name}.txt")
    
    # Create output file path
    output_file = os.path.join(cwd, f"Output_TestCases/{input_file_name}_mask{mask_file_name}_k{kernel_number}_o.txt")
    
    # Create bin directory if it doesn't exist
    os.makedirs(os.path.dirname(kernel_exe), exist_ok=True)
    
    # Compile
    !nvcc "{kernel_src}" -o "{kernel_exe}"
    
    # Run with analytics if requested, otherwise run normally
    if analytics:
        # Create analytics_Bin directory if it doesn't exist
        analytics_dir = os.path.join(cwd, "analytics_Bin")
        os.makedirs(analytics_dir, exist_ok=True)
        
        # Set profile output path inside analytics_Bin folder
        profile_output = os.path.join(analytics_dir, f"profile_k{kernel_number}_{input_file_name}_mask{mask_file_name}")
        
        # Run with nsys profiling
        !nsys profile --sample=none --trace=cuda --force-overwrite=true --stats=true --output="{profile_output}" "{kernel_exe}" "{input_file}" "{mask_file}" "{output_file}"
        print(f"Analytics data saved to {profile_output}")
    else:
        # Run normally
        !"{kernel_exe}" "{input_file}" "{mask_file}" "{output_file}"
    
    return output_file

# Example usage:
# output = compile_and_run_kernel(1, "vector1", "mask1")
# output = compile_and_run_kernel(2, "vector2", "mask2", analytics=True)

### **Requriment - 1**
- Use only 1 block for your kernel and let the CPU handle the final sum.


In [12]:
compile_and_run_kernel(1, "t1", "m1", analytics=False)

Current working directory: e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution
kernel1.cu
tmpxft_00003db0_00000000-10_kernel1.cudafe1.cpp
   Creating library e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution\cuda_kernels\bin\kernel1.lib and object e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution\cuda_kernels\bin\kernel1.exp
Kernel execution time: 193.121000 ms


'e:\\02_Learn\\01_University\\Senior-1 Spring\\Current\\Parallel Computing\\Labs\\Lab_4\\Solution\\Output_TestCases/t1_maskm1_k1_o.txt'

### **Requriment - 2**
- Use only 1 block for your kernal and let one thread to handle the final sum.


In [13]:
compile_and_run_kernel(2, "t1", "m1", analytics=False)

Current working directory: e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution
kernel2.cu
tmpxft_000064f4_00000000-10_kernel2.cudafe1.cpp
   Creating library e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution\cuda_kernels\bin\kernel2.lib and object e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution\cuda_kernels\bin\kernel2.exp
Kernel execution time: 48.191000 ms


'e:\\02_Learn\\01_University\\Senior-1 Spring\\Current\\Parallel Computing\\Labs\\Lab_4\\Solution\\Output_TestCases/t1_maskm1_k2_o.txt'

### **Requriment - 3**
- Use multiple blocks for your kernal and let the CPU handle the final sum.


In [14]:
compile_and_run_kernel(3, "t1", "m1", analytics=False)

Current working directory: e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution
      int halo_right_idx = (blockIdx.x + 1) * blockDim.x + maskRadius - 1;
          ^


      int shared_mem_size = tile_size + 2 * maskRadius;
          ^

kernel3.cu
tmpxft_00003b90_00000000-10_kernel3.cudafe1.cpp
   Creating library e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution\cuda_kernels\bin\kernel3.lib and object e:\02_Learn\01_University\Senior-1 Spring\Current\Parallel Computing\Labs\Lab_4\Solution\cuda_kernels\bin\kernel3.exp
Kernel execution time: 62.043000 ms


'e:\\02_Learn\\01_University\\Senior-1 Spring\\Current\\Parallel Computing\\Labs\\Lab_4\\Solution\\Output_TestCases/t1_maskm1_k3_o.txt'