<a href="https://colab.research.google.com/github/AnastasiaBrinati/Progetto-ML-23-24/blob/main/prepocessing_gpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The "SGEMM GPU kernel performance" dataset

The dataset is available at https://huggingface.co/datasets/inria-soda/tabular-benchmark/tree/main/reg_cat.

(and also at https://huggingface.co/datasets/anastasiafrosted/gpu_anastasia)


## Imports

In [None]:
# Use seaborn for pairplot.
!pip install -q seaborn
!pip install datasets

In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Make NumPy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

import tensorflow as tf

from tensorflow import keras
from keras import layers

## Get the data

In [3]:
from datasets import load_dataset
dataset = load_dataset("anastasiafrosted/gpu_anastasia", download_mode="force_redownload")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Train dataset size: 193280
Test dataset size: 48320


In [6]:
raw_df = pd.DataFrame(dataset['train'])
raw_columns = list(dataset['train'].features.keys())
raw_df

Unnamed: 0.1,Unnamed: 0,MWG,NWG,KWG,MDIMC,NDIMC,MDIMA,NDIMB,KWI,VWM,VWN,STRM,STRN,SA,SB,Run1 (ms),Run2 (ms),Run3 (ms),Run4 (ms),avg_runs
0,135475,64,128,32,16,16,8,16,8,2,2,0,0,1,1,39.74,39.73,39.71,39.73,39.7275
1,30555,32,32,32,16,16,16,8,8,2,2,1,0,1,1,37.04,36.89,36.24,36.82,36.7475
2,37111,32,64,16,16,16,16,16,8,1,2,0,1,1,1,32.81,32.89,32.63,32.75,32.7700
3,223611,128,128,32,8,16,32,8,2,2,8,1,0,1,1,919.07,886.25,886.06,886.48,894.4650
4,123017,64,128,16,32,16,32,32,8,1,4,1,0,0,1,30.78,30.76,30.77,30.78,30.7725
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193275,59158,32,128,32,8,32,8,16,2,2,4,0,1,1,0,22.24,21.95,22.15,23.00,22.3350
193276,150875,128,32,16,8,8,16,16,2,8,2,1,0,1,1,161.07,161.26,161.10,161.25,161.1700
193277,127702,64,128,32,8,16,16,8,2,1,8,0,1,1,0,178.02,176.55,175.63,176.74,176.7350
193278,464,16,16,16,8,16,16,16,8,1,1,0,0,0,0,135.54,135.57,135.30,135.33,135.4350


## Clean the data

The `"Unnamed: 0"` column is just an id, not relevant for our goals, so we are just going to ignore it.

Here follows an explanation for the other features, (found at https://archive.ics.uci.edu/dataset/440/sgemm+gpu+kernel+performance):

- 1-2. MWG, NWG: per-matrix 2D tiling at workgroup level: {16, 32, 64, 128} (integer);
- 3.  KWG: inner dimension of 2D tiling at workgroup level: {16, 32} (integer);
- 4-5. MDIMC, NDIMC: local workgroup size: {8, 16, 32} (integer);
- 6-7. MDIMA, NDIMB: local memory shape: {8, 16, 32} (integer);
- 8. KWI: kernel loop unrolling factor: {2, 8} (integer);
- 9-10. VWM, VWN: per-matrix vector widths for loading and storing: {1, 2, 4, 8} (integer);
- 11-12. STRM, STRN: enable stride for accessing off-chip memory within a single thread: {0, 1} (categorical);
- 13-14. SA, SB: per-matrix manual caching of the 2D workgroup tile: {0, 1} (categorical).

The columns Run1, Run2, Run3, Run4 measure the performance times in milliseconds for 4 independent runs using the same parameters. They range between 13.25 and 3397.08.
The last column 'avg_runs' just averages over the 4 total runs.


There is no need to process the dataset further because the categorical features are already one-hot encoded.


#### Split the data into training and test sets

Moreover, the dataset was loaded onto the hugging hub already with a split into training set and test set (80%-20%).

In [9]:
train_dataset = raw_df.drop(axis=1, columns=['Unnamed: 0'])
test_dataset = pd.DataFrame(dataset['test']).drop(axis=1, columns=['Unnamed: 0'])

#### Inspect the data

Review a few columns from the training set.. need to do some more research..

In [10]:
train_df = train_dataset.drop(axis=1, columns=['Run1 (ms)', 'Run2 (ms)', 'Run3 (ms)', 'Run4 (ms)'])
test_df = test_dataset.drop(axis=1, columns=['Run1 (ms)', 'Run2 (ms)', 'Run3 (ms)', 'Run4 (ms)'])
column_names = train_df.columns
train_df.head()

Unnamed: 0,MWG,NWG,KWG,MDIMC,NDIMC,MDIMA,NDIMB,KWI,VWM,VWN,STRM,STRN,SA,SB,avg_runs
0,64,128,32,16,16,8,16,8,2,2,0,0,1,1,39.7275
1,32,32,32,16,16,16,8,8,2,2,1,0,1,1,36.7475
2,32,64,16,16,16,16,16,8,1,2,0,1,1,1,32.77
3,128,128,32,8,16,32,8,2,2,8,1,0,1,1,894.465
4,64,128,16,32,16,32,32,8,1,4,1,0,0,1,30.7725


In [None]:
sns.pairplot(train_df[column_names], diag_kind='kde')

Let's also check the overall statistics. Note how each feature covers a very different range:

In [11]:
train_df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MWG,193280.0,80.453146,42.471526,16.0,32.0,64.0,128.0,128.0
NWG,193280.0,80.369619,42.465333,16.0,32.0,64.0,128.0,128.0
KWG,193280.0,25.514818,7.855294,16.0,16.0,32.0,32.0,32.0
MDIMC,193280.0,13.93572,7.871337,8.0,8.0,8.0,16.0,32.0
NDIMC,193280.0,13.955877,7.883382,8.0,8.0,8.0,16.0,32.0
MDIMA,193280.0,17.375207,9.395169,8.0,8.0,16.0,32.0,32.0
NDIMB,193280.0,17.374834,9.387766,8.0,8.0,16.0,32.0,32.0
KWI,193280.0,4.997734,3.000007,2.0,2.0,2.0,8.0,8.0
VWM,193280.0,2.449364,1.954978,1.0,1.0,2.0,4.0,8.0
VWN,193280.0,2.447429,1.953661,1.0,1.0,2.0,4.0,8.0


## Split features from labels


Separate the target value—the "label"—from the features. This label is the value that you will train the model to predict.

In [12]:
train_features = train_df.copy()
test_features = test_df.copy()

train_labels = train_features.pop('avg_runs')
test_labels = test_features.pop('avg_runs')

In [13]:
train_features

Unnamed: 0,MWG,NWG,KWG,MDIMC,NDIMC,MDIMA,NDIMB,KWI,VWM,VWN,STRM,STRN,SA,SB
0,64,128,32,16,16,8,16,8,2,2,0,0,1,1
1,32,32,32,16,16,16,8,8,2,2,1,0,1,1
2,32,64,16,16,16,16,16,8,1,2,0,1,1,1
3,128,128,32,8,16,32,8,2,2,8,1,0,1,1
4,64,128,16,32,16,32,32,8,1,4,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193275,32,128,32,8,32,8,16,2,2,4,0,1,1,0
193276,128,32,16,8,8,16,16,2,8,2,1,0,1,1
193277,64,128,32,8,16,16,8,2,1,8,0,1,1,0
193278,16,16,16,8,16,16,16,8,1,1,0,0,0,0


## Normalization

In the table of statistics it's easy to see how different the ranges of each feature are:

In [None]:
train_df.describe().transpose()[['mean', 'std']]

Unnamed: 0,mean,std
MWG,80.453146,42.471526
NWG,80.369619,42.465333
KWG,25.514818,7.855294
MDIMC,13.93572,7.871337
NDIMC,13.955877,7.883382
MDIMA,17.375207,9.395169
NDIMB,17.374834,9.387766
KWI,4.997734,3.000007
VWM,2.449364,1.954978
VWN,2.447429,1.953661


It is good practice to normalize features that use different scales and ranges.

One reason this is important is because the features are multiplied by the model weights. So, the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.

Although a model *might* converge without feature normalization, normalization makes training much more stable.

Note: There is no advantage to normalizing the one-hot features—it is done here for simplicity. For more details on how to use the preprocessing layers, refer to the [Working with preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) guide and the [Classify structured data using Keras preprocessing layers](../structured_data/preprocessing_layers.ipynb) tutorial.

# **qua ci va il pezzetto di codice che scala il dataset e restituisce i due file csv, non so dove lo avevo scritto sorry :)**

In [14]:
import RobustScaler

ModuleNotFoundError: No module named 'RobustScaler'

### The Normalization layer

The `tf.keras.layers.Normalization` is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

In [None]:
normalizer = tf.keras.layers.Normalization(axis=-1)

Then, fit the state of the preprocessing layer to the data by calling `Normalization.adapt`:

In [None]:
normalizer.adapt(np.array(train_features))

Calculate the mean and variance, and store them in the layer:

In [None]:
print(normalizer.mean.numpy())

[[80.453 80.37  25.515 13.936 13.956 17.375 17.375  4.998  2.449  2.447
   0.5    0.5    0.5    0.5  ]]
