# **PamCrash dataset analysis**

Renault’s Engineering teams need to perform thousands of crash simulation each day for validating the safety of the next generations of vehicles. These simulations run on DIRE’s HPC Platform, which comprises about 100.000 CPU cores and a specialized server infrastructure. Each simulation takes multiple hours, sometimes days, on multiple servers used in parallel. A typical crash simulation takes about 8 hours on 180 CPU cores (5 servers). 

The annual cost of simulation is in the millions of euros, including infrastructure costs (servers) and license costs. It is thus crucial to optimize the way crash simulations are executed. 

To limit the cost, one possibility would be to slow down simulations when there’s not business impact. The scenario would be the following: if a simulation would complete outside of the engineering business hours, the simulation could be slowed down to complete right before the next work session (e.g. the day after or on Monday morning). 

Slowing down simulations saves money the same way driving slower saves gasoline for the same distance driven. Our usual cost metric is the number of core multiplied by the number of hours they are used, noted core.hours and the faster the simulation, the higher the number of core.hours spent for a given simulation, thus the higher the cost.

Automatically slowing down simulations running outside business hours should have no impact on the business. One prerequisite to automatically slow down these simulations is to be able to accurately predict the time they would take depending on the number of allocated servers, in order to predict the optimal number of cores to allocate to a simulation to ensure the business do not wait for the result. A simulation returned too early leads to an unnecessary extra cost; a simulation arriving too late leads to a loss of productivity for the engineer or the team waiting for the result. 

The objective of this challenge is to build a reliable and robust predictor that can be used in an algorithm for deciding the number of cores to be allocated to a simulation. The construction of the decision algorithm based on this predictor is outside the scope of this study. 

# Nouvelle section

### **Linking the notebook to the drive to access the dataset**



In [None]:
# Use this code if running on GCP's Labs for Renault

In [None]:
%%bigquery df
SELECT 
*
FROM
  `challenge.training_data`

In [None]:
# Use this code if running on Google Colaboratory instead, or your own Jupyter Notebook
#import pandas as pd
#train_df = pd.read_csv("https://drive.google.com/u/0/uc?id=1nYfF1tsQtEo0YBAlu0iE5qohp7kAKe2c&export=download", sep=";")
#train_df = train_df.rename(columns={"TZC FINAL": "TZC_FINAL"}, errors="raise")
#train_df = train_df.rename(columns={"MPLINK+NTNU": "MPLINK_NTNU"}, errors="raise")

### **Loading the libraries required for the analysis and plots**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import operator
from sklearn.preprocessing import LabelEncoder
%matplotlib inline

### **Creating some usefull functions**

- **filter**: filter the dataframe by column value.

- **discretize**: transforms the values of a column from continuous to categorical. discretize( [10,11,15,19,21],[15,20]) returns [0,0,1,1,2] (0 for x<15, 1 for 15<=x<20, 2 for x>=20)

- **plot_pie**: draws a pie of the distribution of values of a given dataset

In [None]:
def filter_aux(first, op, last):
    ops = {'>': operator.gt,
           '<': operator.lt,
           '>=': operator.ge,
           '<=': operator.le,
           '==': operator.eq}
    return ops[op](first, last)

def filter(data, category, op, value):
    m = filter_aux(data[category], op, value)
    df = data[m]
    return df

def discretize(data, bins):
    data = np.asarray(data)
    data = np.digitize(data, bins)
    return data

def plot_pie(data, category, title, lb=None, ax1=None):
  dic = {}
  labels = list(set(data[category]))
  if lb == None:
    lb = labels
  sizes = []
  explode = np.zeros((len(labels)))
  for value in labels:
    nom = len(filter(data, category, "==", value))
    dic[value] = (nom/ len(data[category])) * 100
    sizes.append(dic[value])
  explode[sizes.index(max(sizes))] = 0.1
  explode = tuple(explode)
  fig1, ax1 = plt.subplots()
  wedges, autotexts =ax1.pie(sizes, explode=explode, textprops=dict(color="w"))
  ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
  ax1.legend(wedges, lb ,
            title = title,
            loc = "center left",
            bbox_to_anchor = (1, 0, 0.5, 1))

  plt.setp(autotexts, size=8, weight="bold")

  plt.show()

### **What does our dataset look like?**


#### **Columns description**

We load the dataset and use pandas.head() to visualize a sample of the data. The following is a brief explanation of each column:

**JOBID**: The ID of each simulation (for information only).

**PERFORMANCE**: Type of crash performance (front crash, lateral, pedestrian, etc).

**PRECISION**: Numerical precision (simple or double). Impacts the performance of the simumation.

**RUNEND**: The duration of the simulation, in milliseconds.

**TIMESTEP**: A time step given by the user in ms. The maximum number of iterations is equal to RUNEND/TIMESTEP but the simulation can be stopped earlier.

**DATACHECK_TIME**: Execution time (in seconds) of the pre-computing phase, allowing us to check that the parameters are valid.

**NBNODES**: Total amount of nodes of the model. The 1D, 2D and 3D elements are composed of such nodes.

**NBELEM2D**: Number of 2D elements in the model. Each element composed of several nodes.

**ELAPSEDTIME**: Total time of the simulation, in seconds. This is the variable we're trying to predict.

**TZC**: Output column. Average time required per node per iteration. This column is given for advanced usage only because predicting TZC may help predicting ELAPSEDTIME.

In [None]:
df.head()

In [None]:
# let's analyse the dataset structure ("object" means "string")
df.info()

#### **A bit of statistical analysis on the data**


**Removing some useless columns**

The ID of the simulation, the day and the hour the simulation is started shouldn't have any impact on its run time so we remove it. "TZC FINAL" is an ouput column and we don't iterate on it in this baseline analysis.

In [None]:
df = df.drop(["JOBID", "DAY", "HOUR", "TZC_FINAL"], axis=1)

**Understanding the dataset**

We compute some statistical analysis functions on each column such as the mean, standard deviation, frequency, etc.

If a statistic isn't compatible with the column type, a "NaN" value will be dispalyed (such as mean or std for a categorical variable).
 

In [None]:
df.describe(include="all")

**Keeping only the HPC3 Cluster**

We'll focus only on HPC3 cluster in the training since this is the cluster on which the number of servers vary.

In [None]:
HPC3 = df[df["CLUSTER"] == "HPC3"].drop(["CLUSTER"], axis=1).copy()

**Displaying the frequency of each value in every variable**

We display the frequency of each value for some variables that have a big impact on the ELAPSEDTIME variable.

In [None]:
HPC3['PERFORMANCE'].value_counts()

In [None]:
HPC3['PRECISION'].value_counts()

In [None]:
HPC3['VERSION'].value_counts()

In [None]:
HPC3['NBSERVERS'].value_counts()

### **Graphical displays to help have an intuition about the data**




# Nouvelle section

In [None]:
Pie_HPC3 = HPC3.copy()
# Modify the dataset to add the cardinality in the column name.
# Used only to display the pie charts
perf = HPC3["PERFORMANCE"].value_counts().to_dict()
nbserv = HPC3["NBSERVERS"].value_counts().to_dict()
for key,value in perf.items():
  perf[key] = key+" : "+str(perf[key])
for key,value in nbserv.items():
  nbserv[key] = str(key)+" : "+str(nbserv[key])
Pie_HPC3 = Pie_HPC3.replace({"PERFORMANCE":perf})
Pie_HPC3 = Pie_HPC3.replace({"NBSERVERS":nbserv})

#### **Data Distribution**

We use a PiePlot to display the proportion of each category of the PERFORMANCE and NBSERVERS variables.

In [None]:
plot_pie(Pie_HPC3, "PERFORMANCE", "Performance types : Occurences")
# TODO Donner les cardinalités, idem pour les graphes suivants

In [None]:
plot_pie(Pie_HPC3,"NBSERVERS","Servers : Occurences")

**Discretization of continuous variables**

We transform continuous data into discretized/categorical data, in order to have an intuition about the proportion of each category using the PiePlot.

In [None]:
Pie_HPC3["ELAPSEDTIME_disc"] = discretize(HPC3["ELAPSEDTIME"], [5*3600, 7*3600, 10*3600, 12*3600, 18*3600, 24*3600])
tab = ["<5h", "<7h", "<10h", "<12h", "<18h", "<24h", ">24h"]
elap = {}
for i in range(len(tab)):
  elap[i] = tab[i] + " : " + str(Pie_HPC3["ELAPSEDTIME_disc"].value_counts()[i])
Pie_HPC3 = Pie_HPC3.replace({"ELAPSEDTIME_disc":elap})
plot_pie(Pie_HPC3, "ELAPSEDTIME_disc", "ELAPSEDTIME : Occurences")

#### **Correlation between features**

HeatMap is a type of data visualization that shows the correlation between each two varaibles. In other words, we intend to see how much impact each variable has on the Output Variable (ELAPSEDTIME) and if there are two variables that tell the same information (getting rid of one of them in that case won't affect the results).

**Encoding categorical data**

We need to transform categorical data into numerical data in order for the HeatMap to be able to use it. For example, we can represent this set of values [Yes, No] as [1, 0]

In [None]:
numerical = list(df.describe().columns)
categorical = [col for col in df.columns if col not in numerical]
for column in categorical:
  df[column] = LabelEncoder().fit_transform(df[column])

**Visualizing the HeatMap and correlation table**

The color of the field indicates the correlation's degree. Red is for positive correlation (incrementing variable1 increments variable2's value), Blue is negative correlation (incrementing varialbe1 decrements variable2's value), whilst white is neutral (no correlation = varialbe1 has no effect on variable2)


In [None]:
correl = df.corr(method = 'pearson')
sns.set(rc={'figure.facecolor':'white'})
fig, ax = plt.subplots(figsize=(10,10)) 
correl_final = sns.heatmap(correl, vmin = -1, vmax = 1, center = 0, cmap = "RdBu_r", square = True, ax=ax)
correl_final.set_title('Correlation between features', fontsize = 25, loc = 'left')

In [None]:
# Display the Heatmap's rounded values
round(correl,2)