# Data analysis for CSP 11C

Available results: `Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 GEOS2 IFPEN OpenGoSim1 OpenGoSim2 OPM1 OPM2 OPM3 OPM4 Pau-Inria PFLOTRAN SINTEF1 SINTEF2 SINTEF3 SLB TetraTech1 TetraTech2 UT-CSEE`.

## Computing the SPE11 distance

Run the data analysis and compute the SPE11 distance between all selected results.
With the `-g` option, sparse data are selected as submitted by the respective
participants, where `-f` specifies the corresponding folder for all results.
Alternatively, the `-c` options take, for selected quantities, values
post-processed from the submitted dense data that reside in the folder
specified by `-t`, particularly, `-cAB` related to Boxes A and B (imm/mob data),
`-cSealA` and `-cSealB` separately for Boxes A and B (seal data), `-cC` related to Box C.

The results can be displayed and stored in the output folder, defined by the `-o` option. This includes a distance matrix which will be used for further detailed analysis below.

In [None]:
%matplotlib inline
%run ../analysis/compute_spe11_distance.py \
-v spe11c \
-f ../shared_folder/data/spe11c \
-t ../shared_folder/evaluation \
-o output/spe11c \
-g Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 GEOS2 \
   IFPEN OpenGoSim1 OpenGoSim2 OPM1 OPM2 OPM3 OPM4 \
   Pau-Inria PFLOTRAN SINTEF1 SINTEF2 SINTEF3 \
   SLB TetraTech1 TetraTech2 UT-CSEE \
-cAB Calgary CTC-CNE OpenGoSim1 OpenGoSim2 \
-cSealA Calgary PFLOTRAN \
-cSealB Calgary PFLOTRAN \
-cC Calgary CAU-Kiel CTC-CNE Pau-Inria PFLOTRAN SINTEF1 SINTEF2 SINTEF3 SLB \
-cBCO2 PFLOTRAN

## Inspecting the distance matrix
The above computation has stored the distance matrix to file `output/spe11c/spe11c_distance_matrix.csv`. We can fetch and display it.

In [None]:
import pandas as pd
df = pd.read_csv('output/spe11c/spe11c_distance_matrix.csv', index_col=0)

# Uncomment for a quick look at the distance matrix
#df

The script `analyze_distance_matrix` offers further tools for inspection, including visualization, but also determining minimal values.

We start with visualizing the full distance matrix. For this, provide the option `-option show-distance-matrix`.

In [None]:
%run ../analysis/analyze_distance_matrix.py \
-option show-distance-matrix \
-v spe11c \
-o output/spe11c

The respective image of the distance matrix is printed to file `output/spe11c/spe11c_distance_matrix.png`.

By specifying single groups through `-g`, one can extract a subset of the distance matrix for closer inspection.  NOTE: The next code block overwrites the image.

In [None]:
#%run ../analysis/analyze_distance_matrix.py \
#-option show-distance-matrix \
#-v spe11c \
#-o output/spe11c \
#-g GEOS2 OpenGoSim1 OPM4 SINTEF3

## Extracting single distances
We can extract single pairwise distances through the option `-option print-distances` together with a selection of groups thorugh `-g`.

### Example analysis: Different groups using the same simulator / Same groups using different simulators
E.g. we can inspect the use of OPM Flow and SLB IX by different groups, in constrast to same groups using different simulators.

In [None]:
%run ../analysis/analyze_distance_matrix.py \
-option print-distances \
-v spe11c \
-o output/spe11c \
-g OPM1 CAU-Kiel \
-g CTC-CNE SLB \
-g TetraTech1 TetraTech2

### Example analysis: Mesh refinement
We can also inspect the effect of grid refinement, studied by single groups.

In [None]:
%run ../analysis/analyze_distance_matrix.py \
-option print-distances \
-v spe11c \
-o output/spe11c \
-g GEOS1 GEOS2 \
-g OpenGoSim1 OpenGoSim2 \
-g GEOS2 OpenGoSim2

## Medians and correlations
The SPE11 distance uses scalings for each considered reporting quantity. It is essentially based on a median value of the distances of each type of quantity. Then the total SPE11 distance agglomerates the single distances, resulting in the natural question how well the single distances correlate to the global distance. The Pearson correlation coefficient (PCC) offers a quantitative measure. As part of the computation of the SPE11 distance, these statistics are recorded in the output folder as `spe11b_statistics.csv`

In [None]:
import pandas as pd
df = pd.read_csv('output/spe11c/spe11c_statistics.csv')
df

## Visualization of clustering
The distance matrix provides means for linkage clustering vilsualized by a dendrogram. Using the option `-option show-clustering` the dendrogram is displayed and stored to file.

In [None]:
%run ../analysis/analyze_distance_matrix.py \
-option show-clustering \
-v spe11c \
-o output/spe11c

# Finding the smallest mean distance (median result)
We define the median submission to be the submission with lowest average mean to all other submissions. For this we call the script `analyze_distance_matrix` and specify the option `-option find-min-mean-distance`. We can make a selection of groups selected for the analysis using `-g`.

In [None]:
%run ../analysis/analyze_distance_matrix.py \
-option find-min-mean-distance \
-v spe11c \
-o output/spe11c \
-g Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 GEOS2 \
   IFPEN OpenGoSim1 OpenGoSim2 OPM1 OPM2 OPM3 OPM4 \
   Pau-Inria PFLOTRAN SINTEF1 SINTEF2 SINTEF3 SLB \
   TetraTech1 TetraTech2 UT-CSEE

## Finding the smallest pair-wise distance
By using the option `-option find-min-distance`, we can search for the smallest distance between two distinct submissions among the provided groups. Again, we can restrict the analysis to a subset using `-g`.

In [None]:
%run ../analysis/analyze_distance_matrix.py \
-option find-min-distance \
-v spe11c \
-o output/spe11c \
-g Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 GEOS2 \
   IFPEN OpenGoSim1 OpenGoSim2 OPM1 OPM2 OPM3 OPM4 \
   Pau-Inria PFLOTRAN SINTEF1 SINTEF2 SINTEF3 SLB \
   TetraTech1 TetraTech2 UT-CSEE

## Finding the smallest group-wise distance
By adding multiple groups through repeated use of `-g`, one introduces a collection of subgroups. For the option `-option find-min-distance`, the use of multiple subgroups, allows for excluding group-intern comparisons within the single selections. Using this, one can analyze e.g. the smallest distance between submissions from different groups.

In [None]:
%run ../analysis/analyze_distance_matrix.py \
-option find-min-distance \
-v spe11c \
-o output/spe11c \
-g Calgary \
-g CAU-Kiel \
-g CSIRO \
-g CTC-CNE \
-g GEOS1 GEOS2 \
-g IFPEN \
-g OpenGoSim1 OpenGoSim2 \
-g OPM1 OPM2 OPM3 OPM4 \
-g Pau-Inria \
-g PFLOTRAN \
-g SINTEF1 SINTEF2 SINTEF3 \
-g SLB \
-g TetraTech1 TetraTech2 \
-g UT-CSEE

# Variability analysis (all submissions)

We can compute the variability in a single group. For this, we specify the group through `-g`. For example, we can compute the overall variability.

In [None]:
%run ../analysis/variability_analysis.py \
-v spe11c \
-o output/spe11c \
-g Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 GEOS2 \
   IFPEN OpenGoSim1 OpenGoSim2 OPM1 OPM2 OPM3 OPM4 \
   Pau-Inria PFLOTRAN SINTEF1 SINTEF2 SINTEF3 SLB \
   TetraTech1 TetraTech2 UT-CSEE

Similarly, we can ask for the baseline variability. For this, we need to specify the baseline group as input. For SPE11B, these are `Calgary`, `CAU-Kiel`, `CSIRO`, `CTC-CNE`, `GEOS1`, `IFPEN`, `OpenGoSim1`, `OPM1`, `Pau-Inria`, `SLB`, `Tetratech1`, `Tetratech2`.

In [None]:
%run ../analysis/variability_analysis.py \
-v spe11c \
-o output/spe11c \
-g Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 IFPEN OpenGoSim1 OPM1 Pau-Inria \
SLB TetraTech1 TetraTech2

# Statistical analysis (comparison of variability)

## Baseline variability vs. overall variability

We can compute p-values of null-hypotheses comparing the variability of two groups, quantifying the statistical significance. E.g. we analyze whether the base case group has smaller variability than the overall variability. For this, we specify `-g-smaller` and `g-greater`, here being the base case group and all groups, respectively. 

In [None]:
%run ../analysis/variability_analysis.py \
-v spe11c \
-o output/spe11c \
-g-smaller Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 IFPEN \
   OpenGoSim1 OPM1 Pau-Inria SLB TetraTech1 TetraTech2 \
-g-greater Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 GEOS2 \
   IFPEN OpenGoSim1 OpenGoSim2 OPM1 OPM2 OPM3 OPM4 \
   Pau-Inria PFLOTRAN SINTEF1 SINTEF2 SINTEF3 SLB \
   TetraTech1 TetraTech2 UT-CSEE

## Commercial vs. academic/open-source simulators
Similarly, we can copute the p-value of the comparison of commercial and academic/open-source against all base case groups. 

The commercial group is given by: `CTC-CNE`, `SLB`, `TetraTech2`

In [None]:
%run ../analysis/variability_analysis.py \
-v spe11c \
-o output/spe11c \
-g-smaller CTC-CNE SLB TetraTech2 \
-g-greater Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 IFPEN \
   OpenGoSim1 OPM1 Pau-Inria SLB TetraTech1 TetraTech2

The academic/open-source group is given by: `Calgary`, `CAU-Kiel`, `CSIRO`, `GEOS1`, `IFPEN`, `OpenGoSim`, `OPM1`, `Pau-Inria`, `TetraTech1`.

In [None]:
%run ../analysis/variability_analysis.py \
-v spe11c \
-o output/spe11c \
-g-smaller Calgary CAU-Kiel CSIRO GEOS1 IFPEN OpenGoSim1 \
   OPM1 Pau-Inria TetraTech1 \
-g-greater Calgary CAU-Kiel CSIRO CTC-CNE GEOS1 GEOS2 \
   IFPEN OpenGoSim1 OPM1 Pau-Inria SLB TetraTech1 TetraTech2