Analysation of HLS Dataset using Machine Learning Models

This project aims to analyze hardware performance metrics related to tasks executed on FPGA (Field-Programmable Gate Array) systems using High-Level Synthesis (HLS).

Test Environment

Python 3.9.6

Usage

Run the script using:

python3 main.py

Installation

To run this project, you need to have Python installed along with the following libraries: numpy, pandas, seaborn, matplotlib, scikit-learn

Below are the following codes needed:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

You can import these libraries through Pip.

Load Dataframe

The dataframe used in this project is contributed by @georgewzg95. The dataset is loaded from a CSV file. The dataframe contains 1340 entries and different FPGA tasks. The dataset contains various hardware performance measurements. Columns include:

clock_speed: The speed at which the task operates. Higher clock speeds usually indicate faster processing times.

alm(Adaptive Logic Modules): Represents the number of adaptive logic modules used by the task. ALMs are the basic building blocks in FPGAs that implement logic functions.

reg (Registers): The number of registers used by the task. Registers are small storage locations within a processor or FPGA that hold data temporarily.

dsp (Digital Signal Processing Units): pecialized hardware units designed to efficiently perform complex mathematical computations, particularly for digital signal processing tasks.

ram (Random Access Memory):a type of computer memory that can be accessed randomly, and it is used to store data temporarily while the task is running.

mlab (Memory Lab Usage): specialized memory blocks within FPGAs used for various memory-related operations.

The purpose of 'pd.read_csv('data_intel.csv')' is to Specify the path to the CSV file to be loaded. Ensure the file path is correct and accessible from the script's location. 'ata_intel.csv' = CSV dataset used in the project.

HLS_data = pd.read_csv('data_intel.csv')

Function definition

Here are some API and variables used in order to analyze the dataset:

X: Feature variables used as input for the first regression model to predict alm.

y: Target variable for the first regression model, representing all.

X1: Feature variables used as input for the second regression model to predict clock_speed.

y1: Target variable for the second regression model, representing clock_speed.

mse: Mean Squared Error, used to evaluate the performance of the regression models.

r2: R-squared, used to evaluate the goodness of fit of the regression models.

coef and coef1: Coefficients of the linear regression models.

scaled_HLS_features: Standardized features from the dataset used for clustering.

inertia: A measure of how internally coherent the clusters are in K-means clustering.

clusters: Cluster assignments for each task generated by the K-means algorithm.

tasks_by_cluster: A collection of task names grouped by their respective clusters.

Features

There are four different sections of the codes and three ML models created.

Data Exploration and Visualization:

Load and inspect the dataset.
Visualize relationships between various hardware metrics and 'clock_time'/'alm'
Explore feature distributions using histograms.

vars = HLS_data.columns[3:]
figure, axes = plt.subplots(len(vars), 3, figsize=(15, 30))
sns.set(font_scale=0.8)
figure.subplots_adjust(hspace=0.8, wspace=0.6)
for var in vars:
    if var != 'alm':
        sns.scatterplot(x=var, y='alm', data=HLS_data, ax=axes[vars.tolist().index(var),0], alpha=0.4).set_title(f'{var} vs alm', fontsize=7, weight='bold')
    else:
        continue
    if var != 'clock_speed':
        sns.scatterplot(x=var, y='clock_speed', data=HLS_data, ax=axes[vars.tolist().index(var),1], alpha=0.4).set_title(f'{var} vs clock_speed', fontsize=7, weight='bold')
    else:
        continue
    sns.histplot(x=var, data=HLS_data, ax=axes[vars.tolist().index(var),2]).set_title('Distribution', fontsize=7, weight='bold')
plt.show()
zvs = HLS_data.select_dtypes(include=[np.number])

Linear Regression Model for Clock_time and alm

Compute and display correlation coefficients to identify relationships between key features with timing and operation delay prediction/ Power estimate
Develop and evaluate linear regression models to predict alm (Adaptive Logic Modules) and clock_speed based on other hardware metrics.
Assess model performance using metrics such as Mean Squared Error (MSE) and R-squared (R²).
Coefficient analyzed.

ALM Correlation:

Strong positive correlation with reg (0.981) and mlab (0.909)
Moderate positive correlation with dsp (0.703)
Weak negative correlation with clock_speed (-0.362)

Clock Speed Correlation:

Weak negative correlation with alm (-0.362) and mlab (-0.414)
weak correlation with other metrics

ALM Prediction: Model Performance: R² = 0.988, MSE = 24,238,366.47

Positive influences: reg (0.404), mlab (8.980)
Negative influences: clock_speed (-23.433), dsp (-96.263), ram (-7.495)

Clock Speed Prediction: Model Performance: R² = 0.382, MSE = 5,969.85

Positive influences: dsp (0.378), mlab (0.048)
Negative influences: alm (-0.008), reg (0.003), ram (-0.097)

Clustering

Understand and describe the characteristics of each cluster, providing insights into resource-intensive tasks, performance-optimized tasks, and more.
Create three different clusters and categorize them using various measures.
Cluster 0: Tasks with moderate resource usage and lower clock speeds, possibly less intensive but still need considerable resources. Example Tasks: atax_1, bicg_14, k2mm_9, bfs_bulk_43, gemm_blocked_39
Cluster 1: High resource usage across all metrics, likely representing the most demanding tasks that need powerful hardware. Example Tasks: gemm_ncubed_6, spmv_crs_27, stencil2D_36
Cluster 2: High clock speeds but minimal resource consumption, indicating highly efficient tasks that are less resource-intensive but performance-oriented. Example Tasks: atax_3, bicg_5, k3mm_4, syrk_12, bfs_queue_1

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
README.md		README.md
data_intel.csv		data_intel.csv
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysation of HLS Dataset using Machine Learning Models

Test Environment

Usage

Installation

Load Dataframe

Function definition

Features

Data Exploration and Visualization:

Linear Regression Model for Clock_time and alm

Clustering

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Analysation of HLS Dataset using Machine Learning Models

Test Environment

Usage

Installation

Load Dataframe

Function definition

Features

Data Exploration and Visualization:

Linear Regression Model for Clock_time and alm

Clustering

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages