# ImageNet Classification
>Performance analysis for ImageNet Classification on all hardware platforms

- toc: true 
- badges: true
- comments: true
- categories: [ImageNet,Rooflines,Performance Prediction]
- image: images/imagenet_logo.png

In [1]:
#hide
import pandas as pd
import numpy as np
import altair as alt

import utils
import scripts

#from scripts.utils import replace_data_df

W = 600
H = 480
pd.options.display.float_format = '{:20,.6f}'.format
pd.options.display.max_rows = 10000
pd.options.display.max_columns = 10000

csv_path = "./data/cleaned_csv/backup.csv"

In [2]:
#hide
# master_df.loc[(master_df.NN_Topology =='RN50') & (master_df.PruningFactor != '100%')]

# Theoretical Analysis of ImageNet

### Rooflines for All Hardware Platforms and CNNs

Combining application requirements with hardware platform characteristics can be leveraged for performance predictions using UCB’s roofline models. Using assumptions for where weights, activation tensors, and state of a neural network are stored, combined with the size of the datatypes used, allow us to derive the arithmetic intensity of a neural network during inference. Combined with the roofline for a given hardware platform, we can provide insight as to whether a neural network will be memory or compute bound and guidance for what is theoretically possible in regards to its throughput.

In [3]:
#hide_inp-ut
# Run the Rooflines script which processes the data and creates the chart
%run scripts/altair_plots.py
rooflines('imagenet')

### Performance Prediction

The following heatmap shows the theoretical performance for the listed hardware platforms for ImageNet classification. The metric used for the theoretical performance is input/second.
Looking at the plot, it becomes clear that prunning along with quantization outputs some of the best performance results.

In [4]:
#hide_input
%run scripts/altair_plots.py
heatmap('data/performance_predictions_imagenet_mnist_cifar.csv', 'imagenet', 'Performance Prediction for ImageNet')

# Experimental Data Analysis

### Overview of All Measurements for ImageNet

In this table, within the rows, we show the type of hardware platforms that we used for this task (for example FPGA or GPU) and then more specifically the exact name of the different hardware platforms. For each hardware platform, we list the sweep of specific deployment parameters (batch sizes, operating modes etc) that were used for the experimentation in separate columns. In the columns, we show CNN topologies. When a CNN topology was implemented on a given hardware platform, we show in the corresponding cell the precisions (quantization information) and the channel pruning scale. Otherwise, “na” indicates that the topology wasn’t executed on this specific hardware platform. Many combinations between topology and hardware platform are not supported by the vendors dedicated software environments. INTx depicts a fixed point integer representation with x bits. FPy represents a floating point representation with y bits, for example FP32 is singe precision floating point. Table follows below.

In [5]:
#hide
print(pd.read_csv('data/overview_experiments_imagenet.csv').to_markdown())

|    | Hardware   | Platform         | ResNet50                  | GoogLeNetV1   | MobileNet   | Batch/Stream/Thread                  |
|---:|:-----------|:-----------------|:--------------------------|:--------------|:------------|:-------------------------------------|
|  0 | FPGA       | ZCU102-DPU       | [INT8]*[100%,80%,50%,30%] | INT8          | na          | [1,2,3,4,5,6,7,8]                    |
|  1 | FPGA       | ZCU104-DPU       | INT8                      | INT8          | na          | [1,2,3,4,5,6,7,8]                    |
|  2 | FPGA       | Ultra96-DPU      | [INT8]*[100%,80%,50%,30%] | INT8          | INT8        | [1,2,3,4,5,6,7,8]                    |
|  3 | FPGA       | ZCU104-FINN      | na                        | na            | na          | [1,2,4,8,16,32,64,128,256,512,10000] |
|  4 | FPGA       | ZCU104-BISMO     | na                        | na            | na          | [2,4,8,16,32,64,128]                 |
|  5 | GPU        | TX2-maxn         | FP16,FP32

In [6]:
#hide_input
%run scripts/script_tables.py 
#get table with the experiments overview
dataframes = csv_to_dataframe_multiindex(['data/overview_experiments_imagenet_.csv'])
for dataframe in dataframes:   
       display(HTML(dataframe.to_html(index=False)))

Unnamed: 0_level_0,Unnamed: 1_level_0,ImageNet Classification,ImageNet Classification,ImageNet Classification,Unnamed: 5_level_0
Hardware,Platform,ResNet50,GoogLeNetV1,MobileNet,Batch/Stream/Thread
FPGA,ZCU102-DPU,"[INT8]*[100%,80%,50%,30%]",INT8,na,"[1,2,3,4,5,6,7,8]"
,ZCU104-DPU,INT8,INT8,na,"[1,2,3,4,5,6,7,8]"
,Ultra96-DPU,"[INT8]*[100%,80%,50%,30%]",INT8,INT8,"[1,2,3,4,5,6,7,8]"
,ZCU104-FINN,na,na,na,"[1,2,4,8,16,32,64,128,256,512,10000]"
,ZCU104-BISMO,na,na,na,"[2,4,8,16,32,64,128]"
GPU,TX2-maxn,"FP16,FP32","FP16,FP32",na,"[1,2,4,8,16,32,64,128]"
,TX2-maxp,"FP16,FP32","FP16,FP32",na,"[1,2,4,8,16,32,64,128]"
,TX2-maxq,"FP16,FP32","FP16,FP32",na,"[1,2,4,8,16,32,64,128]"
TPU,TPU-fast clk,na,INT8,INT8,[1]
,TPU-slow clk,na,INT8,INT8,[1]


In [12]:
#hide
master_df = pd.read_csv(csv_path)
#fix ResNet50 Pruning values from 100,50,25,12.5 to -> 100,80,50,30
is_maxp = lambda row: row.HWType != "TX2" or row["Op mode"].split(",")[0] == "maxp" or row["Op mode"] == "fast" or row["Op mode"] == "slow"
maxp_df = master_df[master_df.apply(is_maxp, axis=1)]
imagenet_df = maxp_df[maxp_df.NN_Topology.isin(['GoogLeNetv1','ResNet-50','MobileNetv1','ResNet-50v15']) & maxp_df['lat-comp'].notna()]
bad_precisions = ["FP"+str(i) for i in range(17,24)]
#this version has the values for ResNet50 v1.5
imagenet_df.Datatype = imagenet_df.Datatype.apply(lambda x: 'FP16' if x in bad_precisions else x)
imagenet_df["hw_datatype_prun_net"] = imagenet_df.apply(lambda r: "_".join([r.HWType, r.Datatype, r.PruningFactor, r.NN_Topology]), axis=1)

imagenet_df["PruningFactor"] = imagenet_df["PruningFactor"].str.strip("%").astype(float)
norm_by_group(imagenet_df, "lat-comp", "NN_Topology");
imagenet_df["datatype_model"] = imagenet_df.Datatype + '_' + imagenet_df.HWType
imagenet_df.rename(columns={"top1 [%]": "top1"}, inplace=True)
imagenet_df["tag"] = imagenet_df.apply(lambda r: "_".join([r.HWType, r.Datatype, r.NN_Topology, str(r.PruningFactor)]), axis=1)

#filling GOPS data gaps 
imagenet_df['GOPS'] = imagenet_df.apply(lambda r: 1.14 if r.NN_Topology == 'MobileNetv1' else 
                                          ( 3.13 if r.NN_Topology == 'GoogLeNetv1' else 
                                           ( 8.2 if r.NN_Topology == 'ResNet-50v15' else 
                                            ( 7.72 if r.NN_Topology == 'ResNet-50' and r.PruningFactor == 100 else 
                                             ( 6.54 if r.NN_Topology == 'ResNet-50' and r.PruningFactor == 80 else 
                                              ( 3.75 if r.NN_Topology == 'ResNet-50' and r.PruningFactor == 50 else 
                                               ( 2.45 if r.NN_Topology == 'ResNet-50' and r.PruningFactor == 30 else 0 )))))) , axis=1)

#fill in tp-system and tp-cmp
imagenet_df['tp-system'] = imagenet_df['fps-system'] * imagenet_df['GOPS']
imagenet_df['tp-comp'] = imagenet_df['fps-comp'] * imagenet_df['GOPS']
imagenet_df['GOPS'] = imagenet_df['GOPS'] * imagenet_df['batch/thread/stream']
imagenet_df.head(300)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = val

Unnamed: 0,NN_Topology,HWType,Datatype,Op mode,batch/thread/stream,lat-sys,lat-comp,fps-system,fps-comp,tp-system,tp-comp,top1,top5 [%],Base_Pwr_W,Idle_Pwr_W,Full_Pwr_W,GOPS,PruningFactor,level,hw_datatype_prun_net,norm-lat-comp,datatype_model,tag
0,ResNet-50v15,EdgeTPU,INT8,fast,1,,40.44151,10.552,24.727,86.5264,202.7614,,,,,1.19,8.2,100.0,l3,EdgeTPU_INT8_100%_ResNet-50v15,0.954757,INT8_EdgeTPU,EdgeTPU_INT8_ResNet-50v15_100.0
1,ResNet-50v15,EdgeTPU,INT8,fast,1,,40.58504,10.589,24.64,86.8298,202.048,,,,,1.49,8.2,100.0,l3,EdgeTPU_INT8_100%_ResNet-50v15,0.958145,INT8_EdgeTPU,EdgeTPU_INT8_ResNet-50v15_100.0
2,ResNet-50v15,EdgeTPU,INT8,slow,1,,42.35792,10.075,23.608,82.615,193.5856,,,,,0.9623,8.2,100.0,l3,EdgeTPU_INT8_100%_ResNet-50v15,1.0,INT8_EdgeTPU,EdgeTPU_INT8_ResNet-50v15_100.0
3,ResNet-50v15,EdgeTPU,INT8,slow,1,,41.69559,7.111,23.983,58.3102,196.6606,,,,,1.02,8.2,100.0,l3,EdgeTPU_INT8_100%_ResNet-50v15,0.984363,INT8_EdgeTPU,EdgeTPU_INT8_ResNet-50v15_100.0
273,MobileNetv1,EdgeTPU,INT8,slow,1,7.86,4.08249,127.256,244.949,145.07184,279.24186,69.5674,87.7058,0.253,0.253,0.462,1.14,100.0,l3,EdgeTPU_INT8_100%_MobileNetv1,1.0,INT8_EdgeTPU,EdgeTPU_INT8_MobileNetv1_100.0
274,MobileNetv1,EdgeTPU,INT8,fast,1,6.0,2.57047,166.533,389.034,189.84762,443.49876,69.5674,87.7058,0.253,0.253,0.532,1.14,100.0,l3,EdgeTPU_INT8_100%_MobileNetv1,0.629633,INT8_EdgeTPU,EdgeTPU_INT8_MobileNetv1_100.0
403,GoogLeNetv1,EdgeTPU,INT8,slow,1,10.03,5.72131,99.741,174.785,312.18933,547.07705,69.2434,88.4458,0.253,0.253,0.463,3.13,100.0,l3,EdgeTPU_INT8_100%_GoogLeNetv1,0.006099,INT8_EdgeTPU,EdgeTPU_INT8_GoogLeNetv1_100.0
404,GoogLeNetv1,EdgeTPU,INT8,fast,1,7.4,3.64852,135.087,274.084,422.82231,857.88292,69.2434,88.4458,0.253,0.253,0.538,3.13,100.0,l3,EdgeTPU_INT8_100%_GoogLeNetv1,0.00389,INT8_EdgeTPU,EdgeTPU_INT8_GoogLeNetv1_100.0
437,GoogLeNetv1,TX2,FP16,maxp,1,9.93,6.16337,99.8245,169.338,312.450685,530.02794,66.928,87.832,1.8,4.7,8.07,3.13,100.0,l3,TX2_FP16_100%_GoogLeNetv1,0.006571,FP16_TX2,TX2_FP16_GoogLeNetv1_100.0
438,GoogLeNetv1,TX2,FP16,maxp,2,17.06,10.6197,108.36,192.363,339.1668,602.09619,66.928,87.832,1.8,4.7,8.28,6.26,100.0,l3,TX2_FP16_100%_GoogLeNetv1,0.011321,FP16_TX2,TX2_FP16_GoogLeNetv1_100.0


### Line Plot

In [13]:
#hide_input
dataframe = imagenet_df
sel = alt.selection_multi(fields=["hw_datatype_prun_net"], bind="legend")
fig25_dot = alt.Chart(dataframe).mark_point().encode(
    x='lat-comp',
    y=alt.Y('fps-comp', scale=alt.Scale(type="log")),
    color=select_color(sel, 'hw_datatype_prun_net:N'),
    tooltip=['Op mode', 'fps-comp', 'lat-comp', 'HWType', 'batch/thread/stream'],
)
fig25_line = alt.Chart(dataframe).mark_line().encode(
    x='lat-comp',
    y='fps-comp',
    color=select_color(sel, 'hw_datatype_prun_net:N'),
    tooltip=['Op mode', 'fps-comp', 'lat-comp', 'HWType', 'batch/thread/stream'],
)

fig = (fig25_dot+fig25_line).properties(
    title="Latency Versus Performance for Pruned and Quantized ImageNet Classifier Variants",
    width=W,
    height=H,
).add_selection(sel).interactive()

fig

### Boxplots

In [14]:
#hide_input 
#%run scripts/altair_plots.py  #run the plot script if it wasn't previously run
boxplot(df=imagenet_df, xaxis='PruningFactor', yaxis="lat-comp", color_col= 'PruningFactor', facet_column='datatype_model' , title="Latency by Hardware/Framework and Pruning for ImageNet Classification")

In [15]:
#hide_input
#%run scripts/altair_plots.py  #run the plot script if it wasn't previously run
boxplot(df=imagenet_df, xaxis='PruningFactor', yaxis="fps-comp", color_col= 'PruningFactor', facet_column='datatype_model' , title="Throughput by Hardware/Framework and Pruning for ImageNet Classification")

In [16]:
#hide_input
#%run scripts/altair_plots.py  #run the plot script if it wasn't previously run
boxplot(df=imagenet_df, xaxis='PruningFactor', yaxis="Full_Pwr_W", color_col= 'PruningFactor', facet_column='datatype_model' , title="Power Consumption by Hardware/Framework and Pruning for ImageNet Classification")

## Pareto Graphs

The following pareto graph presents the accuracy versus performance in fps for all the Hardware Platforms across different Pruning and Quantization configurations. This provides insights into accuracy-based comparisons.

In [18]:
df_pareto_graph

Unnamed: 0,NN_Topology,HWType,Datatype,Op mode,batch/thread/stream,lat-sys,lat-comp,fps-system,fps-comp,tp-system,tp-comp,top1,top5 [%],Base_Pwr_W,Idle_Pwr_W,Full_Pwr_W,GOPS,PruningFactor,level,hw_datatype_prun_net,norm-lat-comp,datatype_model,tag
0,ResNet-50v15,EdgeTPU,INT8,fast,1,,40.44151,10.552,24.727,86.5264,202.7614,,,,,1.19,8.2,100.0,l3,EdgeTPU_INT8_100%_ResNet-50v15,0.954757,INT8_EdgeTPU,EdgeTPU_INT8_ResNet-50v15_100.0
1,ResNet-50v15,EdgeTPU,INT8,fast,1,,40.58504,10.589,24.64,86.8298,202.048,,,,,1.49,8.2,100.0,l3,EdgeTPU_INT8_100%_ResNet-50v15,0.958145,INT8_EdgeTPU,EdgeTPU_INT8_ResNet-50v15_100.0
2,ResNet-50v15,EdgeTPU,INT8,slow,1,,42.35792,10.075,23.608,82.615,193.5856,,,,,0.9623,8.2,100.0,l3,EdgeTPU_INT8_100%_ResNet-50v15,1.0,INT8_EdgeTPU,EdgeTPU_INT8_ResNet-50v15_100.0
3,ResNet-50v15,EdgeTPU,INT8,slow,1,,41.69559,7.111,23.983,58.3102,196.6606,,,,,1.02,8.2,100.0,l3,EdgeTPU_INT8_100%_ResNet-50v15,0.984363,INT8_EdgeTPU,EdgeTPU_INT8_ResNet-50v15_100.0
273,MobileNetv1,EdgeTPU,INT8,slow,1,7.86,4.08249,127.256,244.949,145.07184,279.24186,69.5674,87.7058,0.253,0.253,0.462,1.14,100.0,l3,EdgeTPU_INT8_100%_MobileNetv1,1.0,INT8_EdgeTPU,EdgeTPU_INT8_MobileNetv1_100.0
274,MobileNetv1,EdgeTPU,INT8,fast,1,6.0,2.57047,166.533,389.034,189.84762,443.49876,69.5674,87.7058,0.253,0.253,0.532,1.14,100.0,l3,EdgeTPU_INT8_100%_MobileNetv1,0.629633,INT8_EdgeTPU,EdgeTPU_INT8_MobileNetv1_100.0
403,GoogLeNetv1,EdgeTPU,INT8,slow,1,10.03,5.72131,99.741,174.785,312.18933,547.07705,69.2434,88.4458,0.253,0.253,0.463,3.13,100.0,l3,EdgeTPU_INT8_100%_GoogLeNetv1,0.006099,INT8_EdgeTPU,EdgeTPU_INT8_GoogLeNetv1_100.0
404,GoogLeNetv1,EdgeTPU,INT8,fast,1,7.4,3.64852,135.087,274.084,422.82231,857.88292,69.2434,88.4458,0.253,0.253,0.538,3.13,100.0,l3,EdgeTPU_INT8_100%_GoogLeNetv1,0.00389,INT8_EdgeTPU,EdgeTPU_INT8_GoogLeNetv1_100.0
437,GoogLeNetv1,TX2,FP16,maxp,1,9.93,6.16337,99.8245,169.338,312.450685,530.02794,66.928,87.832,1.8,4.7,8.07,3.13,100.0,l3,TX2_FP16_100%_GoogLeNetv1,0.006571,FP16_TX2,TX2_FP16_GoogLeNetv1_100.0
438,GoogLeNetv1,TX2,FP16,maxp,2,17.06,10.6197,108.36,192.363,339.1668,602.09619,66.928,87.832,1.8,4.7,8.28,6.26,100.0,l3,TX2_FP16_100%_GoogLeNetv1,0.011321,FP16_TX2,TX2_FP16_GoogLeNetv1_100.0


In [17]:
#hide_input
#%run scripts/altair_plots.py  #run the plot script if it wasn't previously run
# ResNet50 v15 does not have accuracy measurements yet, so it needs to be taken out
df_pareto_graph = imagenet_df[imagenet_df.NN_Topology != 'RN50V15']
pareto_graph(df= df_pareto_graph, 
             groupcol= 'tag', 
             xcol= 'fps-comp', 
             ycol= 'top1', 
             W= W, 
             H= H, 
             title= "ImageNet Cassification Design Space: Accuracy Versus Performance")


IndexError: index 0 is out of bounds for axis 0 with size 0

In [11]:
#hide 
%run scripts/overlapped_pareto.py

df = process_measured_data(csv_filepath= 'data/cleaned_csv/experimental_data_mnist.csv')
df

SyntaxError: invalid syntax (overlapped_pareto.py, line 627)

NameError: name 'process_measured_data' is not defined

# Theoretical Pareto and Measured Pareto Overlapped

In order to easily understand how accurate predictions were, an overlapping between the Theoretical Pareto Plot and Measured Pareto Plot was made. In the plot below we have both theoretical (orange) and measured (blue) pareto lines. All measured datapoins are represented as crosses and all theoretical datatpoins are represented as circles. Some theoretical datapoints don't have a measured matched datapoint and the same goes for the measured datapoints. The theoretical pareto curve is, as expected, on the right of the measured one, as predictions are sometimes different form measurements.

In [None]:
#-------------------------------------------------
%run scripts/altair_plots.py
def process_theo_top1(csv_theor_accuracies: str) -> pd.DataFrame():
    """
    Method that gets the CNNs and their accuracies table from Theoretical Analysis and melts it into 2 columns 
   
    Parameters
    ----------
    csv_theor_accuracies:str
        Filepath to the CNNs and their accuracy table. 
    
    Returns
    -------
    df_top1_theo: pd.DataFrame()
        Datraframe with 2 columns: |top1 | net_prun_datatype|
        
    """
    # GET THEORETICAL values
    #  get the table above
    df_top1_theo = pd.read_csv(csv_theor_accuracies)
    #  melt it into 2 columns: 
    df_top1_theo = melt_df(df_in= df_top1_theo, cnn_names_col= ' ', new_column_names=['net_prun','datatype','top1'])
    #fix small stuff like deleting rows, merging columns...
    df_top1_theo = fix_small_stuff_df(df= df_top1_theo, col_to_drop=['index','datatype','net_prun'] )
    #  now we have: top1 | net_prun_datatype 
    return df_top1_theo

#------------------------------------------------------

def process_theo_fps(df_top1_theo:pd.DataFrame(),csv_files: list) -> pd.DataFrame():
    """
    Method that gets the data from the csv of the Heatmap tables.
    Merges this theoretical df with the given theoretical df (fps+top1) on the 'net_prun_datatype' common column.
    Removes nans from the 'values' column. Changes column order and columns names.
    Replaces things to match.
    
    Notes: Values on the shared column need to be equal for them to be included on the merge. 
            Eg.: 'MLP_100%_INT2' has to match with 'MLP_100%_INT2' otherwise what comes from the performance precitions will be ignored
 
    Parameters
    ----------
    csv_theor_accuracies:str
        Filepath to the CNNs and their accuracy table. 
    
    Returns
    -------
    df_top1_theo: pd.DataFrame()
        Datraframe with 2 columns: |top1 | net_prun_datatype|
        
    """
    
    df_fps_theo = pd.DataFrame()
    for csv_file in csv_files:
        df_tmp = pd.read_csv(csv_file)
        df_fps_theo = pd.concat([df_fps_theo, df_tmp])
    df_fps_theo['x']= df_fps_theo['x'].str.replace('-','_')
    #    remove rows that have 'nan' in the 'values' column
    df_fps_theo = df_fps_theo[df_fps_theo['values'].notna()]
    #    rename columns
    df_fps_theo.columns=['hardw','net_prun_datatype','fps']

    #   Merge both Theoretical dataframes: fps + top1 
    df_fps_top1_theo = pd.merge(df_top1_theo, df_fps_theo, on='net_prun_datatype', how='outer')
    #  change column order
    df_fps_top1_theo = df_fps_top1_theo[['net_prun_datatype', 'hardw', 'top1', 'fps']]
    #  change column names
    df_fps_top1_theo.columns = ['net_prun_datatype', 'hardw_datatype', 'top1', 'fps-comp']
    
    #Notes: 1. make sure everything in 'net_prun_datatype' column has network + pruning + datatype. If not it will fail
    df_fps_top1_theo = replace_data_df(df_=df_fps_top1_theo, column= 'net_prun_datatype', list_tuples_data_to_replace= [('GoogLeNetv1','GoogLeNetv1_100%'),('MobileNetv1','MobileNetv1_100%'),('GoogleNetv1','GoogleNetv1_100%'), ('EfficientNet_S','EfficientNet-S_100%'), ('EfficientNet_M','EfficientNet-M_100%'), ('EfficientNet_L','EfficientNet-L_100%'), ('%','')])
    #  now that we have: net_prun_datatype | hardw_datatype | top1 | fps-comp
    return df_fps_top1_theo

#----------------------------------------------------------------------

def melt_df(df_in: pd.DataFrame(), cnn_names_col: str, new_column_names: list)->pd.DataFrame():
    """Melts a dataframe into 2 columns, the 'cnn_names_col' and the 'value' column. 
    
    Parameters
    ----------
    df_in : pd.DataFrame()
        Dataframe which will be melted.
    cnn_names_col: str
        Column/s which will not be selected to be melted. Eg.:First column ' '.
        
    new_column_names: str
        New column names to give to the dataframe. 
    
    Returns
    -------
    df_out: pd.DataFrame()
        Returns the melted dataframe with the specified column names.
        
        
    """
    df_out = pd.DataFrame()
    #  select all columns except first
    columns = (df_in.loc[:, df_in.columns!=cnn_names_col]).columns 
    for column in columns:
        # melt df1 to have only 2 columns
        df_tmp = pd.melt(df_in, id_vars=[cnn_names_col], value_vars=column) 
        df_out = pd.concat([df_out,df_tmp])
    # setting new columns names
    df_out.columns = new_column_names 
    return df_out

#-----------------------------------------------------

def spot_no_match(list_: list) -> list:
    """
    Method that creates a list of hexadecimal colors. Colors depend on wheteher there is a substring inside each
    list_ item. For 'no match' the color is black, else, the color is created randomly 
   
    Parameters
    ----------
    list_ : list
        List of strings.  
    
    Returns
    -------
    list_of_colors: list
        List with the same size as the input list. Each item is a hexadecimal color. 
               
    """
    sub='no_match'
    list_of_colors=[]
    for index, word in enumerate(list_):
        #if there is no match then appned the black color
        if sub in word:
            list_of_colors.append('#000000')
        else:
            # create random color
            color = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])]
            list_of_colors.append(color[0])
    return list_of_colors
#-----------------------------------------------------

def get_point_chart_enhanced(df: pd.DataFrame, color_groupcol: str,  shape_groupcol: str,  
                    xcol: str,  ycol: str,  shapes: str, title: str, legend_title_groupcol: str)->alt.vegalite.v4.api.Chart: 
    
    """
    Creates an elaborated point chart with the following configurations:
        -different colors
        -different shapes
        -black color to datapoints that don't have a match (theoretical-measured)
        -x axis log scale
        -Text on plot
        -Tooltips
   
    Parameters
    ----------
    df : pd.DataFrame
        
    color_groupcol: str
        Column name which will be what distinguishes colors. 
    shape_groupcol: str
        Column name which will be what distinguishes shapes.
    xcol: str
        Column name which will be the x axis.
    ycol: str
        Column name which will be the y axis.
    shapes: str
        Desired shape range.
    title: str
        Plot title.
    
    legend_title_groupcol:
        Title of the Legend.
    Returns
    -------
    Vega chart: alt.vegalite.v4.api.Chart
        List with the same size as the input list. Each item is a hexadecimal color. 
               
    """
    domain = df[color_groupcol].unique().tolist()
    range_= spot_no_match(list_= domain)
    points= alt.Chart(df).mark_point(size=100, opacity=1, filled =True).properties(
            width= W,
            height= 1.3*H,
            title=title
        ).encode(
            x= alt.X(xcol,  scale=alt.Scale(type="log")),
            y=alt.Y(ycol + ":Q", scale=alt.Scale(zero=False)),
            color=alt.Color(color_groupcol, scale=alt.Scale(domain=domain, range=range_), legend=alt.Legend(columns=2, title = legend_title_groupcol)),
            #tooltip=["HWType", "Precision", "PruningFactor", "batch/thread/stream", ycol, xcol],
            shape=alt.Shape(shape_groupcol, scale=alt.Scale(range=shapes), legend=alt.Legend(title = 'Datapoint Type')),
            tooltip=['hardw_datatype_net_prun',color_groupcol, shape_groupcol, xcol, ycol],

        )
    text = points.mark_text(
        angle=325,
        align='left',
        baseline='middle',
        dx=7
    ).encode(
        text='hardw'
    )
    return (points + text).interactive()
#----------------------------------------------------

def get_line_chart(df: pd.DataFrame, groupcol: str, xcol: str, ycol:str, color:str) ->alt.vegalite.v4.api.Chart:
    """
    Creates simple line chart. With tooltips and log scale on the x axis
   
    Parameters
    ----------
     df: pd.DataFrame()
        Contains the data for the chart.
     groupcol: str
        Column name which can be what distinguishes colors. Not used atm. 
     xcol: str
        Column name which will be the x axis.
     ycol: str
        Column name which will be the y axis.
     color: str
        Line color.
    Returns
    -------
    chart: alt.vegalite.v4.api.Chart
         Returns a simple altair line chart based on the inputs.
        
    """
    return alt.Chart(df).interactive().mark_line(point=True).encode(
        x=alt.X(xcol, scale=alt.Scale(type='log')),
        y=alt.Y(ycol + ":Q", scale=alt.Scale(zero=False)),
        color=alt.value(color),
        tooltip=[groupcol, xcol, ycol],
    )
#---------------------------------------------------

def get_pareto_df(df: pd.DataFrame(), groupcol: str, xcol: str, ycol: str) -> pd.DataFrame():
    """
    Creates a pareto dataframe. This method doesn't take into account when the lines go up instead of being constant. 
   
    Parameters
    ----------
     df: pd.DataFrame()
        Dataframe from which the pareto dataframe will be created. 
     groupcol: str
         Column name which will be used to determine the groups for the groupby.
     xcol: str
          Column name which will be used for the x axis later. Used in te groupby.    
     ycol: str
         Column name which will be the used for the groupby to create the y axis.
    Returns
    -------
    pareto_line_df: pd.DataFrame()
        Dataframe with the pareto information.
        
    """
    pareto_line_df = df.groupby(groupcol)[xcol].max().to_frame("x")
    pareto_line_df['y'] = df.groupby(groupcol)[ycol].agg(lambda x: x.value_counts().index[0])
    pareto_line_df.sort_values('y', ascending=False, inplace=True)
    pareto_line_df['x'] = pareto_line_df.x.cummax()
    pareto_line_df.drop_duplicates('x', keep='first', inplace=True)
    pareto_line_df['group'] = pareto_line_df.index
    return pareto_line_df

#-----------------------------------------------------


def get_several_paretos_df(list_df: pd.DataFrame, groupcol: str, xcol: str, ycol:str, colors: list)->pd.DataFrame():
    """Method that:
        -Receives several dataframes as input inside a list. For each one of them:
            -Gets the pareto dataframe;
            -Creates a line chart from the above mentiioned pareto dataframe;
        -creates a df with all charts inside a column;
   
    Parameters
    ----------
     list_df: pd.DataFrame()
        Contains all dataframes from which the line charts will be generated and put inside the output dataframe (df_out_charts).
     groupcol: str
         Column name which will be used to determine the groups for the groupby.
     xcol:str
         Column name which will be used for the x axis later. Used in te groupby.
     ycol: str
         Column name which will be the used for the groupby to create the y axis.
     colors:list
         List with the colors for each line plot, for each dataframe inside the input list_df.
         
    Returns
    -------
    df_out_charts: pd.DataFrame()
       Dataframe with all output charts.
        
    """
    df_out_charts = pd.DataFrame(columns=['charts'])
    for i, df in enumerate(list_df):
        pareto_df = get_pareto_df(df= df , groupcol= groupcol, xcol= xcol, ycol= ycol)
        chart = get_line_chart(df= pareto_df, groupcol= 'group', xcol= 'x', ycol= 'y', color = colors[i]) 
        df_out_charts = df_out_charts.append(pd.DataFrame([[chart]], columns=['charts']))
    return df_out_charts

#---------------------------------------------


def process_measured_df(df_theoret: pd.DataFrame(), csv_measured: str )-> pd.DataFrame():
    """ Method that gets the measured dataframe from the csv file and fixes small stuff inside it, concatenates with the theoretical df.
   
    Parameters
    ----------
     df_theoret: pd.DataFrame()
        Datafrmae which will be concatenated with the measured df.
        
     csv_measured: str
         Path to the csv file in which small stuff will be fixed inside it. 
    Returns
    -------
    df_out: pd.DataFrame()
       Processed dataframe which is the combination of theoretical with measured.
        
    """
    #   get the measured dataframe
    df_measured = pd.read_csv(csv_measured)
    #   fix samll stuff in the measured dataframe so things match
    df_measured = replace_data_df(df_=df_measured, column='hardw_datatype_net_prun', list_tuples_data_to_replace=[("RN50", "ResNet50"),("MNv1", "MobileNetv1"),('GNv1','GoogLeNetv1'),('100.0','100')])
    df_measured = replace_data_df(df_=df_measured, column='network', list_tuples_data_to_replace=[("RN50", "ResNet50"),("MNv1", "MobileNetv1"),('GNv1','GoogLeNetv1')])
    #  concatenate both measured with theoretical
    df_out = pd.concat([df_theoret, df_measured])
    
    return df_out
#-------------------------------------------------

def select_cnn_match_theo_for_measured(df_theo: pd.DataFrame(), net_prun_datatype: str) -> pd.DataFrame():
    """
    Method that processes the dataframe to make it look like the measured dataframe so they can be matched together later.
    Eliminates all NaNs and replaces elements to make dfs look alike. 
   
    Parameters
    ----------
    df_theo: pd.DataFrame()
        Dataframe with the data upon which these alterations will be done
    net_prun_datatype: str
        Column name which should have the network, pruning factor and datatype.
    Returns
    -------
    df_theo: pd.DataFrame()
        Processed df. 
        
    """
    # create a subset from the given dataframe
    #     there is another way to do this 
    #df_theo = df_superset[df_superset.apply(lambda row: row[net_prun_datatype].split('_')[0] == cnn_keyword, axis=1)]
    #    the line below is not needed because there is only 1 classification
    #df_theo = df_superset.loc[df_superset[net_prun_datatype].str.contains(cnn_keyword, na=False)]
    df_theo = df_theo[df_theo['top1'].notna()]
    df_theo = df_theo[df_theo['fps-comp'].notna()]
    
    #   given that we have on theoretical df:  net_prun_datatype | hardw_datatype | top1 | fps-comp
    #   and that we have on the measured df:   hardw_datatype_net_prun | batch/thread/stream  | hardw | network | fps-comp | top1 | type
    #We need to:
    #   1. Create 'network', 'type', 'hardware' and 'hardw_datatype_net_prun'
    df_theo['network'] = df_theo[net_prun_datatype].str.split('_').str[0]
    df_theo['type'] = 'predicted'
    #replace elemnts out of hardw column - take datatypes out of hardw_datatype column
    df_theo = replace_data_df(df_=df_theo, column= 'hardw_datatype', list_tuples_data_to_replace=[("-INT2", ""), ("-INT4", ""), ("-INT8", ""), ("-FP16", ""), ("-FP32", "")])      
    # 'hardw_datatype' column only has the hardware now
    df_theo['hardw_datatype_net_prun'] = df_theo['hardw_datatype']+'_'+df_theo[net_prun_datatype].str.split('_').str[2] +'_'+ df_theo['network']+'_'+df_theo[net_prun_datatype].str.split('_').str[1]
        
    #   delet unnecessary columns
    df_theo = df_theo.drop(columns = ['net_prun_datatype'])
    #  change column order
    df_theo= df_theo[['hardw_datatype_net_prun', 'hardw_datatype','network', 'fps-comp', 'top1', 'type']]

    #   rename columns
    df_theo.columns=['hardw_datatype_net_prun','hardw','network', 'fps-comp', 'top1', 'type']
    return df_theo
#-------------------------------------------------------------

def fix_small_stuff_df(df: pd.DataFrame(), col_to_drop: list, ) -> pd.DataFrame():
    """Method that fixes small stuff in a dataframe. Things like:
        -remove rows with 'nm'
        - drop unnecessary columns
        -...
   
    Parameters
    ----------
     df: pd.DataFrame()
        Dataframe which will endure all there alterations.
     col_to_drop: list
         List of columns to be dropped.
    Returns
    -------
    df_out: pd.DataFrame()
       Processed dataframe.
        
    """
    df_out = df.copy()
    #   delete all rows that have 'top1 (top5) [%]' inside
    df_out = df_out[df_out['top1'] !='top1 (top5) [%]']
    #    delete all rows with 'nm'
    df_out = df_out[df_out.top1!='nm'] 
    df_out = df_out.reset_index()
    #   merge 'net_prun' with 'datatype' column into 'net_prun_datatype'
    df_out['net_prun_datatype'] = df_out.net_prun + ' ' + df_out.datatype
    df_out = df_out.drop(columns = col_to_drop)

    #    Some cells have [top1 (top5)] accuracies, create col only with top1 acc
    df_out['top1'] = df_out['top1'].str.split(' ').str[0] #take top5 acc out
    #    separate by underscore instead of space
    df_out = replace_data_df(df_=df_out, column='net_prun_datatype', list_tuples_data_to_replace=[(' ','_')])
    return df_out

#--------------------------------------------------------
def process_measured_data(csv_filepath:str)->pd.DataFrame():
    """ This is to create a df to be joined with the theoretical df in 'Theoretical Analysis' to create the overlapped paretos

    Steps
    ------
    1. Create subset from imagenet that doesn't have the ResNet50 v15 measurements because it does not have accuracy measures
    2. Create new hardware column that has hardware and operation mode, beware with NaNs
    3. Create new 'hardw_datatype_net_prun' with hardware + datatype + netwrok + pruning
    4. Create a suset of the dataframe with the above mentioned column and the corresponding ones
    5. With groupby for col 'hardw_datatype_net_prun', for each unique value get the rows with biggest batch 
    6. Add 'type column', reset the index from 'hardw_datatype_net_prun' to ints and save it
    
    Parameters
    ----------
     csv_filepath: str
        Contains  the file path to the file with all measurements which will be read and prepared to be later joined with the theoretical predictions
    
      Returns
    -------
    pd.DataFrame()
        Processed dataframe to match the theoretical predictions dataframe.
    
    """
    df = pd.read_csv(csv_filepath)
    # ResNet50 v15 does not have accuracy measurements yet, so it needs to be taken out
    # create df from imagenet_df
    df = df[df.NN_Topology != 'RN50V15']
    # create hardw column to include: hardware + op_mode
    df['hardw'] = df['HWType'] + ('-' + df['Op mode']).fillna('')
    #create hardw_datatype_net_prun col with all those columns merged
    df['hardw_datatype_net_prun'] = df.apply(lambda r: "_".join([r.hardw, r.Datatype, r.NN_Topology, str(r.PruningFactor)]), axis=1)
    #create a subset of the dataframe with only those columns
    df = df[['hardw_datatype_net_prun','hardw', 'NN_Topology' ,'fps-comp', 'top1','batch/thread/stream']]
    #Only get the points corresponding to the biggest batch
    df = df.groupby('hardw_datatype_net_prun')[['batch/thread/stream','hardw', 'NN_Topology','fps-comp', 'top1']].max()
    #add type column
    df['type'] = 'measured'
    # reset index to start being numeric 
    df = df.reset_index()
    #save it all
    df.to_csv('data/cleaned_csv/pareto_data_imagenet.csv', index = False)
    #   change column names
    df.columns = ['hardw_datatype_net_prun', 'batch/thread/stream', 'hardw', 'network', 'fps-comp', 'top1', 'type']
    #   fix samll stuff in the df so things match with the other side
    df = replace_data_df(df_=df, column='hardw_datatype_net_prun', list_tuples_data_to_replace=[("RN50", "ResNet50"),("MNv1", "MobileNetv1"),('GNv1','GoogLeNetv1'),('100.0','100'),('25.0','25') ,('50.0','50'),('30.0','30'),('80.0','80')])
    df = replace_data_df(df_=df, column='network', list_tuples_data_to_replace=[("RN50", "ResNet50"),("MNv1", "MobileNetv1"),('GNv1','GoogLeNetv1')])
    #delete unnecessary columns
    df = df.drop(columns=['batch/thread/stream'])

    return df
#---------------------------------------------------------------------------------------------------------------
def identify_pairs_nonpairs(df: pd.DataFrame, column: str) -> pd.DataFrame():
    """This method identifies equal values in the column and signals them, and creates another column with labels for each case 

    Parameters
    ----------
     df: pd.DataFrame()
        Dataframe which will be processed.
     column: str
         Column which has: hardware platform, datatype, network and pruning factor. It has duplicated values.
    Returns
    -------
    df: pd.DataFrame()
       Processed dataframe.
        
    """
    # IDENTIFY ALL PAIRS AND CREATE A SPECIAL COLUMN FOR THEM
    #get all pair and then get unique names out of those pairs
    df['pairs'] = df[column].duplicated(keep=False)
    unique_names = df.loc[df.pairs ==True, column].unique()
    #set a color for each one of them
    color = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
             for i in range(len(unique_names))]
    #put it into a dict
    names_with_colors = {key:color for key,color in zip(unique_names,color)}
    #assign it to the dataframe color column. Only fill up rows (with the same color) that have pairs
    df['color'] = df[column].apply(lambda x: x if x in names_with_colors else '')
    #fill up the rest of the rows that do not have a pair
    df['color'] = df.apply(lambda row: 'predicted_no_match' if row.type=='predicted' and row.color=='' else 
                                                     ('measured_no_match' if row.type=='measured' and row.color=='' else (row.color)), axis=1)
    #df = df.drop(columns=['pairs'])
    return df

#-------------------------------------------------------

#hide
def plot_it_now(df: pd.DataFrame, xcol: str, ycol: str, groupcol: str, title: str) -> alt.vegalite.v4.api.Chart:
    """This method creates all plots for the overlapped pareto and layers them all together. 
    These are: 2 pareto lines and the points plot.
    All points plot are binded to checkboxes
   
    Parameters
    ----------
     df: pd.DataFrame()
        Contains data to be plotted.
     xcol: str
         Dataframe column which has the information for the x axis.
     ycol: str
         Dataframe column which has the information for the y axis.
     groupcol: str
          Dataframe column which has the information for the color.
     title:str
         Title to give to the plot.
    
    Returns
    -------
    Layered chart: -> alt.vegalite.v4.api.Chart
       Layered chart, both theoretical pareto curves and the points chart
        
    """
    #get the pareto data to built the pareto lines
    df_theo =df.loc[df.type=='predicted',:]
    df_exper = df.loc[df.type=='measured',:]
    df_charts = get_several_paretos_df(list_df = [df_theo, df_exper], groupcol= groupcol, xcol= xcol , ycol= ycol, colors=['#FFA500', '#0066CC'])
    
    #this is to be used in the color field to set a different color for each field, and to set to black all that doesn't have a match
    domain = df[groupcol].unique().tolist()
    range_= spot_no_match(list_= domain)
    
    #Select data from the dataframe to bind to each checkbox
    measu_no_match_data= df[df[groupcol].str.contains("measured")]
    predic_no_match_data= df[df[groupcol].str.contains("predicted")]
    FINN_data= df.loc[df[groupcol].str.contains("finn")]
    BISMO_data= df.loc[df[groupcol].str.contains("bismo")]
    A53_data= df.loc[df[groupcol].str.contains("a53")]
    TX2_data= df.loc[df[groupcol].str.contains("tx2")]
    NCS_data= df.loc[df[groupcol].str.contains("ncs")]
    TPU_data= df.loc[df[groupcol].str.contains("tpu")]
    
    #The type of binding will be a checkbox
    filter_checkbox = alt.binding_checkbox()
    
    #Create all checkboxes
    #measu_no_match_select = alt.selection_single( fields=["Hide"], bind=filter_checkbox, name="Measured_Without_Match") 
    #predicted_no_match_select = alt.selection_single( fields=["Hide"], bind=filter_checkbox, name="Predicted_Without_Match") 
    FINN_select = alt.selection_single( fields=["Hide"], bind=filter_checkbox, name="ZCU104_FINN") 
    BISMO_select = alt.selection_single( fields=["Hide"], bind=filter_checkbox, name="ZCU104_BISMO")
    A53_select = alt.selection_single( fields=["Hide"], bind=filter_checkbox, name="U96_Quadcore_A53")
    TX2_select = alt.selection_single( fields=["Hide"], bind=filter_checkbox, name="TX2")
    NCS_select = alt.selection_single( fields=["Hide"], bind=filter_checkbox, name="NCS")
    TPU_select = alt.selection_single( fields=["Hide"], bind=filter_checkbox, name="TPU")
    
    legend_title_groupcol ='Hardw_Datatype_Net_Prun'
    #Color Conditions for each plot
    #measu_no_match_cond= alt.condition(measu_no_match_select,  alt.Color(groupcol+':N', scale=alt.Scale(domain=domain, range=range_),  legend=alt.Legend(columns=2, title = legend_title_groupcol)),alt.value(None))
    #predicted_no_match_cond = alt.condition(predicted_no_match_select, alt.Color(groupcol+':N', scale=alt.Scale(domain=domain, range=range_), legend=alt.Legend(columns=2, title = legend_title_groupcol)), alt.value(None))
    FINN_cond    = alt.condition(FINN_select, alt.Color(groupcol+':N', scale=alt.Scale(domain=domain, range=range_), legend=alt.Legend(columns=2, title = legend_title_groupcol)),alt.value(None))
    BISMO_cond   = alt.condition(BISMO_select, alt.Color(groupcol+':N', scale=alt.Scale(domain=domain, range=range_), legend=alt.Legend(columns=2, title = legend_title_groupcol)),alt.value(None))
    A53_cond     = alt.condition(A53_select, alt.Color(groupcol+':N', scale=alt.Scale(domain=domain, range=range_), legend=alt.Legend(columns=2, title = legend_title_groupcol)),alt.value(None))
    TX2_cond     = alt.condition(TX2_select, alt.Color(groupcol+':N', scale=alt.Scale(domain=domain, range=range_), legend=alt.Legend(columns=2, title = legend_title_groupcol)),alt.value(None))
    NCS_cond     = alt.condition(NCS_select, alt.Color(groupcol+':N', scale=alt.Scale(domain=domain, range=range_), legend=alt.Legend(columns=2, title = legend_title_groupcol)),alt.value(None))
    TPU_cond     = alt.condition(TPU_select, alt.Color(groupcol+':N', scale=alt.Scale(domain=domain, range=range_), legend=alt.Legend(columns=2, title = legend_title_groupcol)),alt.value("white"))
    
    #Create the charts
    #measu_no_match_chart=get_point_chart_selection(df= measu_no_match_data, condition=measu_no_match_cond, selection=measu_no_match_select, color_groupcol= 'color', shape_groupcol= 'type',shapes=['cross', 'circle'], xcol= xcol, ycol= ycol, title=title, legend_title_groupcol="Hardw_Datatype_Net_Prun" )
    #predic_no_match_chart=get_point_chart_selection(df= predic_no_match_data,condition=predicted_no_match_cond, selection=predicted_no_match_select, color_groupcol= 'color', shape_groupcol= 'type',shapes=['cross', 'circle'], xcol= xcol, ycol= ycol, title=title, legend_title_groupcol="Hardw_Datatype_Net_Prun" )
    FINN_chart=get_point_chart_selection(df= FINN_data, condition=FINN_cond, selection=FINN_select, color_groupcol= 'color', shape_groupcol= 'type',shapes=['cross', 'circle'], xcol= xcol, ycol= ycol, title=title, legend_title_groupcol="Hardw_Datatype_Net_Prun" )
    BISMO_chart=get_point_chart_selection(df= BISMO_data,condition=BISMO_cond, selection=BISMO_select, color_groupcol= 'color', shape_groupcol= 'type',shapes=['cross', 'circle'], xcol= xcol, ycol= ycol, title=title, legend_title_groupcol="Hardw_Datatype_Net_Prun" )
    A53_chart=get_point_chart_selection(df= A53_data,condition=A53_cond, selection=A53_select, color_groupcol= 'color', shape_groupcol= 'type',shapes=['cross', 'circle'], xcol= xcol, ycol= ycol, title=title, legend_title_groupcol="Hardw_Datatype_Net_Prun" )
    TX2_chart=get_point_chart_selection(df= TX2_data,condition=TX2_cond, selection=TX2_select, color_groupcol= 'color', shape_groupcol= 'type',shapes=['cross', 'circle'], xcol= xcol, ycol= ycol, title=title, legend_title_groupcol="Hardw_Datatype_Net_Prun" )
    NCS_chart=get_point_chart_selection(df= NCS_data,condition=NCS_cond, selection=NCS_select, color_groupcol= 'color', shape_groupcol= 'type',shapes=['cross', 'circle'], xcol= xcol, ycol= ycol, title=title, legend_title_groupcol="Hardw_Datatype_Net_Prun" )
    TPU_chart=get_point_chart_selection(df= TPU_data,condition=TPU_cond, selection=TPU_select, color_groupcol= 'color', shape_groupcol= 'type',shapes=['cross', 'circle'], xcol= xcol, ycol= ycol, title=title, legend_title_groupcol="Hardw_Datatype_Net_Prun" )
    warnings.filterwarnings("ignore")
    #sum the pareto lines
    chart = df_charts.charts.sum(numeric_only = False)
    #layer the pareto lines with the points chart with checkboxes
    charts = alt.layer(FINN_chart + BISMO_chart + A53_chart+ TX2_chart+ NCS_chart +TPU_chart + chart
    ).resolve_scale(color='independent',shape='independent').properties(title=title)
    return charts

def process_theo_fps(df_top1_theo:pd.DataFrame(), csv_file:str) -> pd.DataFrame():
    """
    Method that gets the data from the csv of the Heatmap tables.
    Merges this theoretical df with the given theoretical df (fps+top1) on the 'net_prun_datatype' common column.
    Removes nans from the 'values' column. Changes column order and columns names.
    Replaces things to match.
    
    Notes: Values on the shared column need to be equal for them to be included on the merge. 
            Eg.: 'MLP_100%_INT2' has to match with 'MLP_100%_INT2' otherwise what comes from the performance precitions will be ignored
 
    Parameters
    ----------
    csv_file:str
        Filepath to the CNNs and their fps 
    
    Returns
    -------
    df_top1_theo: pd.DataFrame()
        Datraframe with 2 columns: |top1 | net_prun_datatype|
        
    """
    
    df_fps_theo = process_csv_for_heatmaps_plot(csv_file)    
    
    #    remove rows that have 'nan' in the 'values' column
    df_fps_theo = df_fps_theo[df_fps_theo['values'].notna()]
    print('FPS------')
    print(df_fps_theo)
    print('TOP1------')
    #    rename columns
    df_fps_theo.columns=['hardw','net_prun_datatype','fps']
    print(df_top1_theo)
    #   Merge both Theoretical dataframes: fps + top1 
    df_fps_top1_theo = pd.merge(df_top1_theo, df_fps_theo, on='net_prun_datatype', how='outer')
    #  change column order
    df_fps_top1_theo = df_fps_top1_theo[['net_prun_datatype', 'hardw', 'top1', 'fps']]
    #  change column names
    df_fps_top1_theo.columns = ['net_prun_datatype', 'hardw_datatype', 'top1', 'fps-comp']
    
    #Notes: 1. make sure everything in 'net_prun_datatype' column has network + pruning + datatype. If not it will fail
    df_fps_top1_theo = replace_data_df(df_=df_fps_top1_theo, column= 'net_prun_datatype', 
                                       list_tuples_data_to_replace= [('GoogLeNetv1','GoogLeNetv1_100%'),
('MobileNetv1','MobileNetv1_100%'),('GoogleNetv1','GoogleNetv1_100%'), ('EfficientNet_S','EfficientNet-S_100%'), 
('EfficientNet_M','EfficientNet-M_100%'), ('EfficientNet_L','EfficientNet-L_100%'), ('%','')])
    #  now that we have: net_prun_datatype | hardw_datatype | top1 | fps-comp
    return df_fps_top1_theo

#-----------------------------------------------------

def get_overlapped_pareto(net_keyword: str):
    """
    Main method to get the overlapped pareto plots.
    What it does: Get top1 acc. -> Get fps correpsonding to previous acc. -> Get measured pareto -> join them -> identify pairs -> plot it
        1.
   
    Parameters
    ----------
    net_keyword: str
        This string should contain the Classification type needed by user. 
        It is not case sensistive.
        Eg.: imagenet, mnist or cifar-10
    Returns
    -------
    Heatmap Chart: altair.vegalite.v4.api.Chart
        This is an Altair/Vega-Lite Heatmap chart. 
        It returns the overlapped pareto plot (theoretical + measured + 2 pareto lines(theoretical+measured)).       
    """
    # 1. Get the CNNs Accuracies table (Theoretical_Analysis/CNNs and their accuracy...) that only has the top1 accuracy and process it.
    #   theoretical top1
    df_top1_theo = process_theo_top1(csv_theor_accuracies ='data/cnn_topologies_accuracy.csv')
    #now we have: |top1 | net_prun_datatype| 
   
    # 2. Now we need Theoretical FPS-COMP to match with that Theoretical TOP1
    # 3. We need to get the above mentioned Theoretical FPS-COMP from the Heatmaps- Performance Predictions and merge them
    # depending on the user input this is retrieved for the desired Classification Task
    if re.search(net_keyword, 'imagenet', re.IGNORECASE):
        df_fps_top1_theo = process_theo_fps(df_top1_theo= df_top1_theo, csv_file="data/performance_predictions_imagenet_mnist_cifar.csv")
        df_measured = process_measured_data(csv_filepath= 'data/cleaned_csv/experimental_data_imagenet.csv')
    elif re.search(net_keyword, 'mnist', re.IGNORECASE):
        df_fps_top1_theo = process_theo_fps(df_top1_theo= df_top1_theo, csv_files=["data/performance_predictions_imagenet_mnist_cifar.csv"])
        df_measured = process_measured_data(csv_filepath= 'data/cleaned_csv/experimental_data_mnist.csv')
    elif re.search(net_keyword, 'cifar-10', re.IGNORECASE):
        df_fps_top1_theo = process_theo_fps(df_top1_theo= df_top1_theo, csv_files=["data/performance_predictions_imagenet_mnist_cifar.csv"])
        df_measured = process_measured_data(csv_filepath= 'data/cleaned_csv/experimental_data_cifar.csv')
    

    df_fps_top1_theo = select_cnn_match_theo_for_measured(df_theo= df_fps_top1_theo, net_prun_datatype = 'net_prun_datatype')
    # now we have: |hardw_datatype_net_prun | hardw | network | fps-comp | top1 | type|
    
    #  concatenate both measured with theoretical to get the overlapped pareto
    overlapped_pareto = pd.concat([df_fps_top1_theo, df_measured])
    # now we have everything together and matched

    #put everything to lowercase
    overlapped_pareto.hardw_datatype_net_prun = overlapped_pareto.hardw_datatype_net_prun.str.casefold() 
    #organize by column alpabetically
    overlapped_pareto= overlapped_pareto.sort_values(by='net_prun_datatype')

    # identify all pairs and create a special column for them 
    overlapped_pareto = identify_pairs_nonpairs(df=overlapped_pareto, column='hardw_datatype_net_prun')
    
    #overlapped_pareto = overlapped_pareto.drop(overlapped_pareto[overlapped_pareto.type=='predicted_no_match|measured_no_match'].index)
    
    # now we have: |hardw_datatype_net_prun | hardw | network | fps-comp | top1 | type | color|
    #return overlapped_pareto
    #plot it
    return plot_it_now(df= overlapped_pareto, xcol= 'fps-comp', ycol= 'top1', groupcol= 'color', title='Overlapped Pareto Plots Theoretical + Measured for' + ' ' + net_keyword.upper())    


In [None]:
#hide_input
#%run scripts/overlapped_pareto.py
df=get_overlapped_pareto('imagenet') 
df#.loc[df.color=='measured_no_match']

In [None]:
import altair as alt
from vega_datasets import data

alt.Chart(data.cars.url).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
    color=alt.Color('Name:N', legend=alt.Legend(columns=5))
).properties(
    # Adjust chart width and height to match size of legend
    width=200,
    height=200
)

In [None]:
import altair as alt
from vega_datasets import data

source = data.unemployment_across_industries.url

selection = alt.selection_multi(fields=['series'], bind='legend')

alt.Chart(source).mark_area().encode(
    alt.X('yearmonth(date):T', axis=alt.Axis(domain=False, format='%Y', tickSize=0)),
    alt.Y('sum(count):Q', stack='center', axis=None),
    alt.Color('series:N', scale=alt.Scale(scheme='category20b')),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_selection(
selection
)

# Efficiency Plot

In order to understand the gap between the theoretical predictions and what was measured, an efficiency bar-chart was created. The size of the bar reflects the absolute performance, whereby all theoretical predictions are shown in red, theoretical peak performance in blue, and all measured datapoints in orange. The orange bars are annotated with the efficiency achieved as a percentage of the predicted performance. Please note the logarithmic y-axis scale. The theoretical predictions take memory bottlenecks into account, as such measured performance can actually exceed the predicted result, in which case the percentage can be above 100%.

In [None]:
#hide_input
%run scripts/overlapped_pareto.py
imagenet_efficiency_df = get_peak_perf_gops_df(df_=imagenet_df_tmp) #takes the imagenet df and fills it with data for the 3rd bar - Theoretical Peak Performance
efficiency_plot(net_keyword= 'imagenet', df_theo_peak_compute=imagenet_efficiency_df, title='Efficiency Plots for ImageNet')

In [None]:
#hide
imagenet_df.to_csv('data/cleaned_csv/experimental_data_imagenet.csv', index = False)
imagenet_df