# PCA_Exp with multiple batches

Tutorial on how to use PCA_Exp package to load and analyse multiple batches of data. With explanation on cumulative PC scores and error calculations.

Let's suppose that we have two batches of data from different datasets:

![title](figure/Figure_MB1.png)

![title](figure/Figure_MB2.png)

The first set of data goes from 0 to 12 of x value and the second ranges from 0 to 16. Also, there are differences in intervals between different values of x (Each measurement in first batch has 300 data points and in the second: 600). The error also varies differently for each batch of measurement. In order for PCA of those two batches to work, the x values have to be the same, and if that is not possible, some approximation has to be performed. That includes cutting and averaging data that has more range and more data points (batch 2 in this example).

Let's start by importing relevant modules and loading batches:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from pca_exp.data_handler import DataHandler
from pca_exp.pca_machine import PCAMachine

In [None]:
dh = DataHandler()

loc1 = "./exp_data_example/multiple_batches/batch1/"
loc2 = "./exp_data_example/multiple_batches/batch2/"

stsp1 = (0,9)
stsp2 = (0,14)

prenum = "ede"
ext = ".txt"

dh.load_batch(stsp1, prenum, ext, loc1, name="batch1")
dh.load_batch(stsp2, prenum, ext, loc2, name="batch2")

The batches are loaded one after another and can be named. By default, the names are just numerical ("1", "2" and so on). Data can be manually sliced or using the built-in function, and the averaging is done by using "bin_data" function:

In [None]:
# Slicing
dh.slice_batch(batch_ind=1, x_vals=(0, 12.1))

# Averaging second batch
x_0 = dh.batches[0][:,0,0]                          # The reference x bins

dh.bin_data(x_0, batch_ind=[1])

The reference "x_0" bins are taken from batch1 because that is the batch with fewer bins initially. If we plot the second batch after slicing and averaging:

In [None]:
fig1, axb2 = plt.subplots(1,1)
axb2.plot(dh.batches[1][:,0,0], dh.batches[1][:,:,1])
axb2.set_xlabel("x")
axb2.set_ylabel("y")
axb2.grid()
plt.show()

The range is now correct and the functions look similar to those in previous figure.

Principal component analysis is performed in the same way as with single batch:

In [None]:
dh.prepare_XYE_PCA(batch_ind=[0,1], preserve_batch_info=True)

pca_machine = PCAMachine(dh)

pca_machine.perform_pca()

To retrieve PCA scores from specific batch we can call "return_scores_of_given_batch" method of PCAMachine class. Batches can be identified by their names or the indices (which correspond to order of loading batches).

In [None]:
# Retrieve score by name
pc_scores_batch1 = pca_machine.return_scores_of_given_batch(batch_name="batch1")
# Retrieve score by index
pc_scores_batch2 = pca_machine.return_scores_of_given_batch(batch_ind=1)

The code now calculates the cumulative principal component scores (for definition check [1]) and error of principal component scores [2]. By default, it returns the cPCS and error as dictionaries with keys pointing to specific batch.

In [None]:
cPCS = pca_machine.calculate_cPCS()
errors = pca_machine.calculate_pc_scores_error()

In [None]:
# Cumulative principal component scores for both batches
fig2, axscpcs = plt.subplots(1,2)
axscpcs[0].bar(np.arange(1,11), cPCS["batch1"][:10], color=['r', 'b', 'g'] + 7 * ['grey'])
axscpcs[1].bar(np.arange(1,11), cPCS["batch2"][:10], color=['r', 'b', 'g'] + 7 * ['grey'])
axscpcs[0].set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
axscpcs[0].set_xticklabels(['1st PC', '2nd PC', '3rd PC', '4th PC', '5th PC', 
                        '6th PC', '7th PC', '8th PC', '9th PC', '10th PC'],
                        rotation=45, ha='right', rotation_mode="anchor")
axscpcs[1].set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
axscpcs[1].set_xticklabels(['1st PC', '2nd PC', '3rd PC', '4th PC', '5th PC', 
                        '6th PC', '7th PC', '8th PC', '9th PC', '10th PC'],
                        rotation=45, ha='right', rotation_mode="anchor")
axscpcs[0].set_title("CPCS - first batch")
axscpcs[1].set_title("CPCS - second batch")
plt.show()

In [None]:
# Principal component scores with errors
T1 = np.arange(0.1, 1.1, 0.1)
T2 = np.arange(0.1, 1.6, 0.1)

fig3, axserr = plt.subplots(3,2)

axserr[0,0].scatter(T1, pc_scores_batch1[0,:], c='r',marker='s',
                                edgecolors='r', zorder=10)
axserr[0,0].errorbar(T1, pc_scores_batch1[0,:], errors["batch1"][0,:],
                    ecolor='k', marker=None, capsize=4, ls="", zorder=3)
axserr[1,0].scatter(T1, pc_scores_batch1[1,:], c='b',marker='s',
                                edgecolors='b', zorder=10)
axserr[1,0].errorbar(T1, pc_scores_batch1[1,:], errors["batch1"][1,:],
                    ecolor='k', marker=None, capsize=4, ls="", zorder=3)
axserr[2,0].scatter(T1, pc_scores_batch1[2,:], c='g',marker='s',
                                edgecolors='g', zorder=10)
axserr[2,0].errorbar(T1, pc_scores_batch1[2,:], errors["batch1"][2,:],
                    ecolor='k', marker=None, capsize=4, ls="", zorder=3)

axserr[0,1].scatter(T2, pc_scores_batch2[0,:], c='r',marker='s',
                                edgecolors='r', zorder=10)
axserr[0,1].errorbar(T2, pc_scores_batch2[0,:], errors["batch2"][0,:],
                    ecolor='k', marker=None, capsize=4, ls="", zorder=3)
axserr[1,1].scatter(T2, pc_scores_batch2[1,:], c='b',marker='s',
                                edgecolors='b', zorder=10)
axserr[1,1].errorbar(T2, pc_scores_batch2[1,:], errors["batch2"][1,:],
                    ecolor='k', marker=None, capsize=4, ls="", zorder=3)
axserr[2,1].scatter(T2, pc_scores_batch2[2,:], c='g',marker='s',
                                edgecolors='g', zorder=10)
axserr[2,1].errorbar(T2, pc_scores_batch2[2,:], errors["batch2"][2,:],
                    ecolor='k', marker=None, capsize=4, ls="", zorder=3)

plt.show()