## Qualification Tool for the RAPIDS Accelerator for Apache Spark

To run the qualification tool, enter the log path that represents the location of your Spark CPU event logs. Then, select "Run all" to execute the notebook. Once the notebook completes, various output tables will appear below. For more options on running the qualification tool, please refer to the [Qualification Tool User Guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/qualification/quickstart.html#running-the-tool).

### Note
- Currently, local and S3 event log paths are supported.
- Eventlog path must follow the formats `/local/path/to/eventlog` for local logs or `s3://my-bucket/path/to/eventlog` for logs stored in S3.
- The specified path can also be a directory. In such cases, the tool will recursively search for event logs within the directory.
   - For example: `/path/to/clusterlogs`
- To specify multiple event logs, separate the paths with commas.
   - For example: `s3://my-bucket/path/to/eventlog1,s3://my-bucket/path/to/eventlog2`


## User Input

In [None]:
# Path to the event log in S3 (or local path)
EVENTLOG_PATH = "s3://my-bucket/path/to/eventlog"  # or "/local/path/to/eventlog"

# S3 path with write access where the output will be copied. 
S3_OUTPUT_PATH = "s3://my-bucket/path/to/output"

## Setup Environment

In [None]:
from IPython.display import display, Markdown

TOOLS_VER = "24.08.2"
display(Markdown(f"**Using Spark RAPIDS Tools Version:** {TOOLS_VER}"))

In [None]:
%pip install spark-rapids-user-tools==$TOOLS_VER --user > /dev/null 2>&1

In [None]:
import os
import pandas as pd

# Update PATH to include local binaries
os.environ['PATH'] += os.pathsep + os.path.expanduser("~/.local/bin")

OUTPUT_PATH = "/tmp"
DEST_FOLDER_NAME = "qual-tool-result"

# Set environment variables
os.environ["EVENTLOG_PATH"] = EVENTLOG_PATH 
os.environ["OUTPUT_PATH"] = OUTPUT_PATH

CONSOLE_OUTPUT_PATH = os.path.join(OUTPUT_PATH, 'console_output.log')
CONSOLE_ERROR_PATH = os.path.join(OUTPUT_PATH, 'console_error.log')

os.environ['CONSOLE_OUTPUT_PATH'] = CONSOLE_OUTPUT_PATH
os.environ['CONSOLE_ERROR_PATH'] = CONSOLE_ERROR_PATH

print(f'Console output will be stored at {CONSOLE_OUTPUT_PATH} and errors will be stored at {CONSOLE_ERROR_PATH}')


## Run Qualification Tool

In [None]:
%%sh
spark_rapids qualification --platform emr --eventlogs "$EVENTLOG_PATH" -o "$OUTPUT_PATH" --verbose > "$CONSOLE_OUTPUT_PATH" 2> "$CONSOLE_ERROR_PATH"

## Console Output
Console output shows the top candidates and their estimated GPU speedup.


In [None]:
%%sh
cat $CONSOLE_OUTPUT_PATH

In [None]:
%%sh
cat $CONSOLE_ERROR_PATH

In [None]:
import re
import shutil
import os


def extract_file_info(console_output_path, output_base_path):
    try:
        with open(console_output_path, 'r') as file:
            stdout_text = file.read()

        # Extract log file location
        location_match = re.search(r"Location: (.+)", stdout_text)
        if not location_match:
            raise ValueError(
                "Log file location not found in the provided text.")

        log_file_location = location_match.group(1)

        # Extract qualification output folder
        qual_match = re.search(r"qual_[^/]+(?=\.log)", log_file_location)
        if not qual_match:
            raise ValueError(
                "Output folder not found in the log file location.")

        output_folder_name = qual_match.group(0)
        output_folder = os.path.join(output_base_path, output_folder_name)
        return output_folder, log_file_location

    except Exception as e:
        raise RuntimeError(f"Cannot parse console output. Reason: {e}")


def copy_logs(destination_folder, *log_files):
    try:
        log_folder = os.path.join(destination_folder, "logs")
        os.makedirs(log_folder, exist_ok=True)

        for log_file in log_files:
            if os.path.exists(log_file):
                shutil.copy2(log_file, log_folder)
            else:
                print(f"Log file not found: {log_file}")
    except Exception as e:
        raise RuntimeError(f"Cannot copy logs to output. Reason: {e}")


try:
    output_folder, log_file_location = extract_file_info(
        CONSOLE_OUTPUT_PATH, OUTPUT_PATH)
    jar_output_folder = os.path.join(output_folder,
                                     "rapids_4_spark_qualification_output")
    print(f"Output folder detected {output_folder}")
    copy_logs(output_folder, log_file_location, CONSOLE_OUTPUT_PATH,
              CONSOLE_ERROR_PATH)
    print(f"Logs successfully copied to {output_folder}")
except Exception as e:
    print(e)

## Download Output

In [None]:
import shutil
import os
import subprocess
from IPython.display import HTML, display
from urllib.parse import urlparse

def display_error_message(error_message, exception):
    error_message_html = f"""
    <div style="color: red; margin: 20px;">
        <strong>Error:</strong> {error_message}.
        <br/>
        <strong>Exception:</strong> {exception}
    </div>
    """
    display(HTML(error_message_html))

def copy_file_to_s3(local_file: str, bucket: str, destination_folder_name: str):
    try:
        file_name = os.path.basename(local_file)
        s3_path = f"s3://{bucket}/{destination_folder_name}/{file_name}"
        subprocess.run(["aws", "s3", "cp", local_file, s3_path], check=True, capture_output=True, text=True)
        return construct_download_url(file_name, bucket, destination_folder_name)
    except subprocess.CalledProcessError as e:
        raise Exception(f"Error copying file to S3: {e.stderr}") from e

def get_default_aws_region():
    try:
        return subprocess.check_output(
            "aws configure list | grep region | awk '{print $2}'",
            shell=True,
            text=True
        ).strip()
    except subprocess.CalledProcessError:
        return "Error: Unable to retrieve the region."

def construct_download_url(file_name: str, bucket_name: str, destination_folder_name: str):
    region = get_default_aws_region()
    return f"https://{region}.console.aws.amazon.com/s3/object/{bucket_name}?region={region}&prefix={destination_folder_name}/{file_name}"

def create_download_link(source_folder, bucket_name, destination_folder_name):
    folder_to_compress = os.path.join("/tmp", os.path.basename(source_folder))
    local_zip_file_path = shutil.make_archive(folder_to_compress, 'zip', source_folder)
    download_url = copy_file_to_s3(local_zip_file_path, bucket_name, destination_folder_name)

    download_button_html = f"""
    <style>
        .download-btn {{
            display: inline-block;
            padding: 10px 20px;
            font-size: 16px;
            color: white;
            background-color: #4CAF50;
            text-align: center;
            text-decoration: none;
            border-radius: 5px;
            border: none;
            cursor: pointer;
            margin: 15px auto;
        }}
        .download-btn:hover {{
            background-color: #45a049;
        }}
        .button-container {{
            display: flex;
            justify-content: center;
            align-items: center;
        }}
        .button-container a {{
            color: white !important;
        }}
    </style>

    <div style="color: #444; font-size: 14px; text-align: center; margin: 10px;">
        Zipped output file created at {download_url}
    </div>
    <div class='button-container'>
        <a href='{download_url}' class='download-btn'>Download Output</a>
    </div>
    """
    display(HTML(download_button_html))

try:
    current_working_directory = os.getcwd()
    parsed_s3_output_path = urlparse(S3_OUTPUT_PATH)
    bucket_name = parsed_s3_output_path.netloc
    destination_path = os.path.join(parsed_s3_output_path.path.strip("/"), DEST_FOLDER_NAME.strip("/"))
    create_download_link(output_folder, bucket_name, destination_path)
    
except Exception as e:
    error_msg = f"Failed to create download link for {output_folder}"
    display_error_message(error_msg, e)


## Summary

The report provides a comprehensive overview of the entire application execution, estimated speedup, including unsupported operators and non-SQL operations. By default, the applications and queries are sorted in descending order based on the following fields:

- Estimated GPU Speedup Category
- Estimated GPU Speedup

In [None]:
def millis_to_human_readable(millis):
    seconds = int(millis) / 1000
    if seconds < 60:
        return f"{seconds:.2f} sec"
    else:
        minutes = seconds / 60
        if minutes < 60:
            return f"{minutes:.2f} min"
        else:
            hours = minutes / 60
            return f"{hours:.2f} hr"

try: 
    # Read qualification summary 
    summary_output = pd.read_csv(os.path.join(output_folder, "qualification_summary.csv"))
    summary_output = summary_output.drop(columns=["Unnamed: 0"]).rename_axis('Index').reset_index()
    summary_output['Estimated GPU Duration'] = summary_output['Estimated GPU Duration'].apply(millis_to_human_readable)
    summary_output['App Duration'] = summary_output['App Duration'].apply(millis_to_human_readable)
    
    summary_output = summary_output[[
        'App Name', 'App ID', 'Estimated GPU Speedup Category', 'Estimated GPU Speedup', 
        'Estimated GPU Duration', 'App Duration'
    ]]
    
    # Read cluster information
    cluster_df = pd.read_json(os.path.join(output_folder, "app_metadata.json"))
    cluster_df['Recommended GPU Cluster'] = cluster_df['clusterInfo'].apply(
        lambda x: f"{x['recommendedCluster']['numWorkerNodes']} x {x['recommendedCluster']['workerNodeType']}"
    )
    cluster_df['App ID'] = cluster_df['appId']
    cluster_df = cluster_df[['App ID', 'Recommended GPU Cluster']]
    
    # Merge the results
    results = pd.merge(summary_output, cluster_df, on='App ID', how='left')
    display(results)
except Exception as e:
    error_msg = "Unable to show summary"
    display_error_message(error_msg, e)


## Application Status

The report show the status of each eventlog file that was provided


In [None]:
try:
    status_output = pd.read_csv(
        os.path.join(jar_output_folder,
                     "rapids_4_spark_qualification_output_status.csv"))

    # Set options to display the full content of the DataFrame
    pd.set_option('display.max_rows', None)  # Show all rows
    pd.set_option('display.max_columns', None)  # Show all columns
    pd.set_option('display.width', None)  # Adjust column width to fit the display
    pd.set_option('display.max_colwidth', None)  # Display full content of each column

    display(status_output)
except Exception as e:
    error_msg = "Unable to show Application Status"
    display_error_message(error_msg, e)        
        
        

## Stages Output

For each stage used in SQL operations, the Qualification tool generates the following information:

1. App ID
2. Stage ID
3. Average Speedup Factor: The average estimated speed-up of all the operators in the given stage.
4. Stage Task Duration: The amount of time spent in tasks of SQL DataFrame operations for the given stage.
5. Unsupported Task Duration: The sum of task durations for the unsupported operators. For more details, see [Supported Operators](https://nvidia.github.io/spark-rapids/docs/supported_ops.html).
6. Stage Estimated: Indicates if the stage duration had to be estimated (True or False).


In [None]:
try:
    stages_output = pd.read_csv(
        os.path.join(jar_output_folder,
                     "rapids_4_spark_qualification_output_stages.csv"))
    display(stages_output)
except Exception as e:
    error_msg = "Unable to show stage output"
    display_error_message(error_msg, e) 

## Execs Output

The Qualification tool generates a report of the “Exec” in the “SparkPlan” or “Executor Nodes” along with the estimated acceleration on the GPU. Please refer to the [Supported Operators guide](https://nvidia.github.io/spark-rapids/docs/supported_ops.html) for more details on limitations on UDFs and unsupported operators.

1. App ID
2. SQL ID
3. Exec Name: Example: Filter, HashAggregate
4. Expression Name
5. Task Speedup Factor: The average acceleration of the operators based on the original CPU duration of the operator divided by the GPU duration. The tool uses historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU.
6. Exec Duration: Wall-clock time measured from when the operator starts until it is completed.
7. SQL Node ID
8. Exec Is Supported: Indicates whether the Exec is supported by RAPIDS. Refer to the Supported Operators section for details.
9. Exec Stages: An array of stage IDs.
10. Exec Children
11. Exec Children Node IDs
12. Exec Should Remove: Indicates whether the Op is removed from the migrated plan.


In [None]:
try:
    execs_output = pd.read_csv(
        os.path.join(jar_output_folder,
                     "rapids_4_spark_qualification_output_execs.csv"))
    display(execs_output)
except Exception as e:
    error_msg = "Unable to show Execs output"
    display_error_message(error_msg, e) 