# Dataset exploration

In this notebook we will get to know the dataset we will use throughout the tutorial series.

The dataset consists of inputs and outputs of a battery degradation model.
More specifically, a pseudo-two-dimensional (P2D) model configured to simulate the formation of the solid electrolyte interphase (SEI) in a battery based on the reduction of the solvent near the surface of the negative electrode during charging.

The electrolyte considered in the model is a mixture of ethyl carbonate/ethyl methyl carbonate (EC/EMC) with LiPF$_6$ salt. Hence, we assume that main product forming the SEI layer is Li$_2$CO$_3$ and it is formed according to the reaction:

$$
\text{S} + 2\text{Li}^+ + 2e^- \rightarrow \text{P}
$$

where $\text{S}$ is the solvent species and $\text{P}$ is the product of the reaction between the solvent ant the Li ions.
The growth of the SEI layer is assumed to be in one-dimension and to be controlled by the kinetics of the reaction occurring at the interphase.

The inputs and outputs are explored in more detail below.

## Dependencies

In [None]:
# imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Load dataset

We can load the dataset directly from the GitHub URL.
Alternatively, the dataset can be loaded from a local file.

In [None]:
# load parameter table
parameters_path = "https://raw.githubusercontent.com/BIG-MAP/sensitivity_analysis_tutorial/main/data/p2d_sei_parameters.csv"
# parameters_path = "./../data/p2d_sei_parameters.csv"  # local
pt = pd.read_csv(parameters_path, index_col=0)
pt.unit.replace(np.nan, "-", inplace=True)

# load dataset
dataset_path = "https://raw.githubusercontent.com/BIG-MAP/sensitivity_analysis_tutorial/main/data/p2d_sei_10k.csv"
# dataset_path = "./../data/p2d_sei_10k.csv"  # local
df = pd.read_csv(dataset_path, index_col=0)

## Parameter table

Let us first have a look at the parameter table that was used to generate the dataset.
The table shows the input parameters of the P2D model along with the selected input ranges (low, high) and nominal values.

In [None]:
pt

## Dataset

The dataset was generated by sampling the input parameters uniformly at random within the ranges given in the parameter table above.
Then the P2D model was used to label each row.
The outputs of the P2D model are stored in the last 2 columns of the dataset:

    - SEI_thickness(m):   Thickness of the solid electrolyte interphase (SEI).
    - Capacity loss (%):  Loss of capacity due to SEI formation.

In this analysis we will primarily focus on `SEI_thickness(m)` and optionally on `Capacity loss (%)`.

In [None]:
# show dataset statistics
df.describe()

In [None]:
# show first rows of the dataset
df.head()

As you might notice, the first row in the dataset corresponds to the nominal values given in the table. 

## Data visualisation

Let us explore the dataset with some visualisations. 

In [None]:
# plot histograms
_ = df.hist(figsize=(20,20))

In [None]:
# plot scatter matrix
n = 1000
_ = pd.plotting.scatter_matrix(df.iloc[:n], figsize=(20,20))

We plot only the first `n = 1000` points in the scatter matrix above as to not crowd the visualisation. Feel free to try and change this number. 

Since the data is sampled at random, you should see that the data covers the entire input range, as defined in the parameter table, and that the input parameters are not correlated.

Perhaps you can also spot some interesting correlations or patterns between the inputs and outputs? It might not be super clear from this figure, but you can try to make a note of the input variables that look interesting to you and follow up on them later in our analysis of this dataset.

## Outputs

Let us have a closer look at the two outputs of interest: `SEI_thickness(m)` and `Capacity loss (%)`.

In [None]:
_ = df[["SEI_thickness(m)", "Capacity loss (%)"]].hist(bins=50, figsize=(10, 4))

Both of these outputs are strictly positive and have long tails.
In our analysis with machine learning models, it might be useful to instead consider the log transformed outputs to account for these properties.

In [None]:
_ = df[["SEI_thickness(m)", "Capacity loss (%)"]].transform(np.log).add_prefix("log ").hist(bins=50, figsize=(10, 4))

After applying the log transformation, the outputs are on an unbounded continuous scale, which will be simpler to model. 

We can also look at the relationship between the two outputs of interest

In [None]:
plt.figure(figsize=(5,5))
plt.plot(df["SEI_thickness(m)"], df["Capacity loss (%)"], ".", alpha=.5)
plt.xlabel("SEI_thickness (m)"); plt.ylabel("Capacity loss (%)")
plt.grid()
plt.show()

It looks like an increase in  `SEI_thickness(m)` correlates with an increase in `Capacity loss (%)`. 
However, it looks like other factors might also lead to high `Capacity loss (%)`.
Perhaps that is something we can also see in the sensitivity analysis. 

## Additional exploration

If there is anything else you are curious to know about the dataset, go ahead and create your own plots and statistics below (or wherever you like). 
Do not forget to save a copy of the notebook if you want to keep your changes. 

In [None]:
# my additional data exploration
