## **Decompressing KMA HSR Composite Rainfall Data**

The **HSR (Hybrid Surface Rainfall) composite rainfall data** provided by the Korea Meteorological Administration (KMA) is available in `.tar.gz` format with 5-minute intervals. Due to the large volume of data, parallel processing is utilized to speed up the decompression process.

#### Overview
- **Data Type**: KMA HSR composite rainfall data
- **File Format**: `.tar.gz`
- **Time Interval**: 5-minute interval data
- **Processing Method**: Parallel processing (using the Joblib library for multi-core CPU utilization)
- **Output**: Extracted raw data stored in the specified directory (`raw_data`)

In [None]:
import os
import gzip
import tarfile
import numpy as np
from PIL import Image
from tqdm import tqdm
from joblib import Parallel, delayed

In [13]:
def extract_single_tar_gz(file_path, output_dir):
    """
    Extracts a single .tar.gz file into the specified output directory.
    """
    print(f'Extracting {file_path}...')

    # Open the tar.gz file and extract its contents
    with tarfile.open(file_path, 'r:gz') as tar:
        def reset_members(tarinfo, path=None):
            # Clear any unsafe file permissions
            tarinfo.mode = 0o755  # Set default permissions
            return tarinfo
        
        # Extract all files into a subdirectory within the output directory
        tar.extractall(path=output_dir, members=None, filter=reset_members)
        print(f'Extracted {file_path} into {output_dir}')

def extract_tar_gz(input_dir, output_dir, n_jobs=-1):
    """
    Extracts all .tar.gz files in the input directory into the specified output directory using joblib for parallelism.
    """
    # Check if the output directory exists, if not create it
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    # Get all .tar.gz files in the input directory
    tar_gz_files = [os.path.join(input_dir, filename) for filename in os.listdir(input_dir) if filename.endswith(".tar.gz")]

    # Use joblib to parallelize the extraction process
    Parallel(n_jobs=n_jobs)(delayed(extract_single_tar_gz)(file_path, output_dir) for file_path in tar_gz_files)

# Example usage:
input_dir = "/home/sehoon/Desktop/측량학회/Code/TODO/"
output_dir = "/home/sehoon/Desktop/측량학회/0919/"

# Set n_jobs to -1 to use all available CPU cores
extract_tar_gz(input_dir, output_dir, n_jobs=-1)

Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202007.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202209.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202210.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202305.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202001.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202308.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202012.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202309.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202201.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202108.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202208.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202211.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202302.tar.gz...
Extracting /home/sehoon/Desktop/측량학회/Code/TODO/RDR_HSR_202205.ta

## **Raw Data to Cropped Images (BIN -> 256x256 Pixels PNG)**

This script processes raw `.bin.gz` files, extracts reflectivity data, crops a specified region of interest (ROI), and saves the result as 16-bit PNG images. The process leverages parallel processing to speed up the conversion of large datasets.

#### 1. **Read the `.gz` file**
- The file is opened using `gzip.open`, and raw binary data is read.

#### 2. **Extract and reshape reflectivity data**
- The dBZ data is extracted by skipping the first 1024 bytes of the file.
- It is reshaped into a 2D array of size `(2881, 2305)`.

#### 3. **Quality control of dBZ values**
- dBZ values less than `-1000` are clamped to `-1000`.
- All values are shifted by adding `+1000` to ensure they are positive.

#### 4. **Flip the image vertically**
- The reflectivity array is flipped upside down using `np.flipud` to match the coordinate system.

#### 5. **Crop the region of interest (ROI)**
- A `256x256` pixel region is cropped from the specified target index (`target_idx`). In this Study, it was set to **Seoul**

#### 6. **Save as 16-bit PNG**
- The cropped array is converted to a 16-bit image using `PIL.Image`.
- The output directory structure is created based on the timestamp in the filename.
- The image is saved with a `.png` extension.

In [5]:
def list_files(src_dir, file_format):
    """
    Retrieve a sorted array of file paths in a directory and its subdirectories, filtered by a specified file format.
    """
    file_names = []
    for (root, directories, files) in os.walk(src_dir):
        for file in files:
            if file.endswith(file_format):
                img_dir = os.path.join(root, file)
                file_names.append(img_dir)

    return np.array(sorted(file_names))

def process_file(gz_path, src_dir, dst_dir, target_idx=(1439, 1214), size=256):
    try:
        # Read the .gz file
        with gzip.open(os.path.join(src_dir, gz_path), 'rb') as f:
            data = f.read()

        # Convert dBZ data and reshape it into an array
        n_row_target, n_col_target = target_idx
        dBZ = np.frombuffer(data[1024:], dtype=np.int16)  # No division by 100
        dBZ = dBZ.reshape((2881, 2305)).copy()  # Copy the array to allow modification

        # QC Reflectivity: Set dBZ values less than -1000 to -1000
        dBZ[dBZ < -1000] = -1000

        # Add +1000 to all dBZ values to ensure they are positive
        dBZ = dBZ + 1000

        # Flip the image vertically (reverse the coordinate system)
        dBZ = np.flipud(dBZ)

        # Crop the ROI (Region of Interest) with a size of 256x256 pixels
        dBZ_cropped = dBZ[n_row_target: n_row_target + size, n_col_target: n_col_target + size]

        # Save the cropped region as a 16-bit PNG image
        img = Image.fromarray(dBZ_cropped.astype(np.uint16))  # Convert to 16-bit before saving
        img_dir = os.path.join(dst_dir, gz_path[-19:-13], gz_path[-13:-11])
        os.makedirs(img_dir, exist_ok=True)
        img.save(os.path.join(img_dir, gz_path[-19:-7] + '.png'))

    except Exception as e:
        print(f"An error occurred for {gz_path}, {e}")
        return gz_path  # Return the file path if an error occurred during processing

    return None  # Indicate successful processing

def img_cropping_and_saving(src_dir, dst_dir):
    input_names = list_files(src_dir, '.bin.gz')
    error_names = []

    # Calculate half of the total CPU cores
    cpu_cores = os.cpu_count()

    # Execute parallel processing
    results = Parallel(n_jobs=cpu_cores)(
        delayed(process_file)(gz_path, src_dir, dst_dir) for gz_path in tqdm(input_names, desc="Processing files")
    )

    # Track files that encountered errors during processing
    error_names = [gz_path for gz_path in results if gz_path is not None]

    return error_names

In [15]:
src_dir = "/home/sehoon/Desktop/측량학회/HSR/"
dst_dir = "/home/sehoon/Desktop/측량학회/dBZ_png/"
target_idx = (1439, 1214)

error_names = img_cropping_and_saving(src_dir, dst_dir)

Processing files:  65%|██████▌   | 256512/394216 [13:20<07:24, 310.11it/s]

An error occurred for /home/sehoon/Desktop/측량학회/HSR/RDR_HSR_202206/09/RDR_CMP_HSR_PUB_202206091115.bin.gz, Error -3 while decompressing data: invalid literal/length code


Processing files:  65%|██████▌   | 256640/394216 [13:21<07:09, 320.33it/s]

An error occurred for /home/sehoon/Desktop/측량학회/HSR/RDR_HSR_202206/09/RDR_CMP_HSR_PUB_202206092005.bin.gz, CRC check failed 0xcca44147 != 0x53a9f75d


Processing files:  66%|██████▌   | 258368/394216 [13:27<08:20, 271.67it/s]

An error occurred for /home/sehoon/Desktop/측량학회/HSR/RDR_HSR_202206/15/RDR_CMP_HSR_PUB_202206152015.bin.gz, Not a gzipped file (b'\x00\x00')


Processing files:  66%|██████▌   | 259328/394216 [13:30<07:26, 301.86it/s]

An error occurred for /home/sehoon/Desktop/측량학회/HSR/RDR_HSR_202206/19/RDR_CMP_HSR_PUB_202206190300.bin.gz, Error -3 while decompressing data: invalid literal/length code


Processing files:  66%|██████▌   | 259520/394216 [13:31<07:22, 304.21it/s]

An error occurred for /home/sehoon/Desktop/측량학회/HSR/RDR_HSR_202206/19/RDR_CMP_HSR_PUB_202206192110.bin.gz, Not a gzipped file (b'\x00\x00')


Processing files: 100%|██████████| 394216/394216 [21:30<00:00, 305.43it/s]
