# RV and RV2 coefficient on Sensory and Fluorescence data

This notebook illustrates how to use the **hoggorm** package to carry out partial least squares regression (PLSR) on multivariate data. Furthermore, we will learn how to visualise the results of the PLSR using the **hoggormPlot** package.

---

### Import packages and prepare data

First import **hoggorm** for analysis of the data and **hoggormPlot** for plotting of the analysis results. We'll also import **pandas** such that we can read the data into a data frame. **numpy** is needed for checking dimensions of the data.

In [2]:
import hoggorm as ho
import hoggormplot as hop
import pandas as pd
import numpy as np

Next, load the data that we are going to analyse using **hoggorm**. After the data has been loaded into the pandas data frame, we'll display it in the notebook.

In [3]:
# Load fluorescence data
X_df = pd.read_csv('cheese_fluorescence.txt', index_col=0, sep='\t')
X_df

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V283,V284,V285,V286,V287,V288,V289,V290,V291,V292
Pr 1,19222.109,19937.834,20491.777,20994.0,21427.5,21915.891,22273.834,22750.279,23215.609,23497.221,...,1338.0557,1311.9445,1275.1666,1235.7777,1204.6666,1184.9445,1140.5,1109.8888,1099.6666,1070.5
Pr 2,18965.945,19613.334,20157.277,20661.557,21167.334,21554.057,22031.391,22451.889,22915.334,23311.611,...,1244.5555,1217.1666,1183.9445,1156.5,1130.0555,1084.0,1066.5,1039.9445,1018.5,992.083313
Pr 3,19698.221,20438.279,21124.721,21740.666,22200.445,22709.725,23222.111,23646.225,24047.389,24519.111,...,1409.5,1366.9445,1319.8888,1289.7778,1258.2223,1235.1666,1200.611,1173.2778,1126.5557,1097.25
Pr 4,20037.334,20841.779,21510.889,22096.443,22605.889,23077.834,23547.725,23974.445,24490.889,24896.945,...,1374.5,1332.3334,1287.5,1252.9445,1228.8334,1195.9443,1159.1666,1153.6112,1117.2223,1088.3334
Pr 5,19874.889,20561.834,21248.5,21780.889,22328.834,22812.057,23266.111,23723.334,24171.221,24601.943,...,1329.0,1291.9445,1256.7778,1226.611,1209.7777,1169.8888,1144.5555,1123.3334,1084.8888,1081.5
Pr 6,19529.391,20157.834,20847.5,21308.111,21716.443,22165.775,22583.166,22993.779,23520.779,24015.221,...,1737.3888,1696.5,1635.5,1580.3334,1556.8334,1501.2222,1463.5555,1419.2778,1365.3888,1343.4166
Pr 7,18795.582,19485.582,20139.584,20644.668,21013.668,21480.668,21873.666,22302.418,22662.5,23097.0,...,1323.3333,1286.9167,1261.0,1235.0833,1190.0833,1174.6667,1129.1667,1095.4166,1070.4166,1049.5
Pr 8,20052.943,20839.445,21569.221,22150.221,22662.389,23160.389,23589.943,24117.5,24484.334,24971.666,...,1140.2778,1113.1112,1075.8334,1055.7778,1037.1112,1025.7778,986.277832,969.388855,944.944397,936.083313
Pr 9,19001.391,19709.943,20368.443,20939.111,21383.111,21879.111,22335.221,22758.834,23213.443,23688.891,...,1119.1666,1076.7777,1045.3888,1033.1112,1021.3333,994.222229,962.111084,943.0,920.166687,899.083313
Pr 10,20602.834,21406.389,22144.611,22775.0,23407.443,23940.609,24486.111,24976.275,25480.779,25966.279,...,1248.2777,1226.7778,1195.0,1169.5,1135.9445,1120.8888,1069.5555,1062.8334,1034.7222,1016.75


In [4]:
# Load sensory data
Y_df = pd.read_csv('cheese_sensory.txt', index_col=0, sep='\t')
Y_df

Unnamed: 0_level_0,Att 01,Att 02,Att 03,Att 04,Att 05,Att 06,Att 07,Att 08,Att 09,Att 10,Att 11,Att 12,Att 13,Att 14,Att 15,Att 16,Att 17
Product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Pr 01,6.19,3.33,3.43,2.14,1.29,3.11,6.7,3.22,2.66,5.1,4.57,3.34,2.93,1.89,1.23,3.15,4.07
Pr 02,6.55,2.5,4.32,2.52,1.24,3.91,6.68,2.57,2.42,4.87,4.75,4.13,3.09,2.29,1.51,3.93,4.07
Pr 03,6.23,3.43,3.42,2.03,1.28,2.93,6.61,3.39,2.56,5.0,4.73,3.44,3.08,1.81,1.37,3.19,4.16
Pr 04,6.14,2.93,3.96,2.13,1.08,3.12,6.51,2.98,2.5,4.66,4.68,3.92,2.93,1.99,1.19,3.13,4.29
Pr 05,6.7,1.97,4.72,2.43,1.13,4.6,7.01,2.07,2.32,5.29,5.19,4.52,3.14,2.47,1.34,4.67,4.03
Pr 06,6.19,5.28,1.59,1.07,1.0,1.13,6.42,5.18,2.82,5.02,4.49,2.05,2.54,1.18,1.18,1.29,4.11
Pr 07,6.17,3.45,3.32,2.04,1.47,2.69,6.39,3.81,2.76,4.58,4.32,3.22,2.72,1.81,1.33,2.52,4.26
Pr 08,6.9,2.58,4.24,2.58,1.7,4.19,7.11,2.06,2.47,4.58,5.09,4.44,3.25,2.62,1.73,4.87,3.98
Pr 09,6.7,2.53,4.53,2.32,1.22,4.16,6.91,2.42,2.41,4.52,4.96,4.49,3.37,2.47,1.64,4.54,4.01
Pr 10,6.35,3.14,3.64,2.17,1.17,2.57,6.5,2.77,2.66,4.76,4.64,4.06,3.11,2.21,1.46,3.35,3.93


The ``RVcoeff`` and ``RV2coeff`` methods in hoggorm accept only **numpy** arrays with numerical values and not pandas data frames. Therefore, the pandas data frames holding the imported data need to be "taken apart" into three parts: 
* two numpy array holding the numeric values
* two Python list holding variable (column) names
* two Python list holding object (row) names. 

In [5]:
# Get the values from the data frame
X = X_df.values
Y = Y_df.values

# Get the variable or columns names
X_varNames = list(X_df.columns)
Y_varNames = list(Y_df.columns)

# Get the object or row names
X_objNames = list(X_df.index)
Y_objNames = list(Y_df.index)

---

### Apply RV and RV2 to our data

Now, let's apply the RV and RV2 matrix correlation coefficient methods on the data [description of the input parameters](https://hoggorm.readthedocs.io/en/latest/matrix_corr_coeff.html). The functions take python lists as input which may contain two or more arrays measured on the same objects and compute RV and RV2 matrix correlation coefficients between pairs of arrays. The number and order of objects (rows) for the two arrays must match. The number of variables in each array may vary. The RV coefficient results in values 0 <= RV <= 1. The RV2 coefficient is a modified version of the RV coefficient with values -1 <= RV2 <= 1. RV2 is independent of object and variable size.

### Preprocessing the data

Arrays need to be preprocessed before computing RV and RV2. More precisely, the arrays need to be either centred or standardised/scaled.

In [9]:
# Center data first
X_cent = ho.center(X_df.values, axis=0)
Y_cent = ho.center(Y_df.values, axis=0)

In [8]:
X_cent

array([[-379.83342857, -380.60057143, -482.455     , ...,   -1.61318736,
          20.33922786,   10.2440525 ],
       [-635.99742857, -705.10057143, -816.955     , ...,  -71.55748736,
         -60.82737214,  -68.1726345 ],
       [  96.27857143,  119.84442857,  150.489     , ...,   61.77581264,
          47.22832786,   36.9940525 ],
       ...,
       [ 680.77857143,  698.06542857,  704.047     , ...,  -77.11308736,
         -85.88297514,  -67.6726345 ],
       [ -93.94242857, -193.98957143, -273.175     , ...,  293.27581264,
         279.00602786,  273.9940525 ],
       [-862.55142857, -874.15957143, -901.677     , ..., -126.27981936,
        -124.60514314, -125.1726345 ]])

In [10]:
Y_cent

array([[-1.70000000e-01,  1.68571429e-01, -2.17142857e-01,
         6.07142857e-02,  5.21428571e-02,  5.21428571e-02,
         2.92857143e-02,  1.97857143e-01,  1.25000000e-01,
         2.54285714e-01, -1.69285714e-01, -3.78571429e-01,
        -6.85714286e-02, -1.57142857e-01, -1.70000000e-01,
        -2.62857143e-01, -7.85714286e-03],
       [ 1.90000000e-01, -6.61428571e-01,  6.72857143e-01,
         4.40714286e-01,  2.14285714e-03,  8.52142857e-01,
         9.28571429e-03, -4.52142857e-01, -1.15000000e-01,
         2.42857143e-02,  1.07142857e-02,  4.11428571e-01,
         9.14285714e-02,  2.42857143e-01,  1.10000000e-01,
         5.17142857e-01, -7.85714286e-03],
       [-1.30000000e-01,  2.68571429e-01, -2.27142857e-01,
        -4.92857143e-02,  4.21428571e-02, -1.27857143e-01,
        -6.07142857e-02,  3.67857143e-01,  2.50000000e-02,
         1.54285714e-01, -9.28571429e-03, -2.78571429e-01,
         8.14285714e-02, -2.37142857e-01, -3.00000000e-02,
        -2.22857143e-01,  8.2

After both arrays were centered, we store them in a list and submit them to the RV or RV2 matrix correlation coefficient function, as described below. Note that the list can contain two or more arrays. The function then returns an array holding RV coefficient for all pair-wise combinations of arrays.

In [19]:
rv_results_cent = ho.RVcoeff([X_cent, Y_cent])

In [23]:
rv_results_cent

array([[1.        , 0.24142324],
       [0.24142324, 1.        ]])

The RV computation results are stored in a new array as seen above. At the diagonal the RV is 1, since the we compute $RV(X_{cent}, X_{cent}) = 1$ and $RV(Y_{cent}, Y_{cent}) = 1$, in each case indicating that the information across the two matrices is identical. Correspondingly, $RV(X_{cent}, Y_{cent}) = 0.24142324$ at index ``[0, 1]`` and $RV(Y_{cent}, X_{cent}) = 0.24142324$ at index ``[1, 0]``.

Now the corresponding computation using the RV2 coefficient.

In [24]:
rv2_results_cent = ho.RV2coeff([X_cent, Y_cent])

In [25]:
rv2_results_cent

array([[1.       , 0.1855865],
       [0.1855865, 1.       ]])

Do the same computations, however with standardised arrays where each feature has the same weight.

In [30]:
# Standardise data first
X_stand = ho.standardise(X_df.values, mode=0)
Y_stand = ho.standardise(Y_df.values, mode=0)

In [26]:
rv_results_stand = ho.RVcoeff([X_stand, Y_stand])

In [27]:
rv_results_stand

array([[1.        , 0.53160759],
       [0.53160759, 1.        ]])

In [28]:
rv2_results_stand = ho.RV2coeff([X_stand, Y_stand])

In [29]:
rv2_results_stand

array([[1.        , 0.43897699],
       [0.43897699, 1.        ]])