After performing the Principal Component Analysis, we can move into actually generating the data.

In [36]:
import numpy as np

from sklearn import decomposition

# Generates data using the Principal Component Analysis result
def generate_data(pca, x, start):
    original = pca.components_.copy()
    ncomp = pca.components_.shape[0]
    a = pca.transform(x)

    for i in range(start, ncomp):
        pca.components_[i,:] += np.random.normal(scale=0.1, size=ncomp)

    b = pca.inverse_transform(a)
    pca.components_ = original.copy()

    return b

Introduce previously analysed Features and Labels NumPy arrays, then split the arrays to use the first `120` records for training and the
remaining records for testing.

In [37]:
feats = np.load("./features.npy")
label = np.load("./labels.npy")

TRAIN_RECORDS = 120

feats_train = feats[:TRAIN_RECORDS]
label_train = label[:TRAIN_RECORDS]

feats_test = feats[TRAIN_RECORDS:]
label_test = label[TRAIN_RECORDS:]

With the array of records in to train in place, Principal Component Analysis from Sklearn is used to perform such calculation.
The number of components to keep is set to 4 due to the fact that this dataset has 4 different features being sepal length, sepal width, petal length and petal width.

Using `shape` from `ndarray` the dimensions and lengths of the `ndarray` are retrieved, given that the features array has a shape of `(120, 4)` we are using a 2 dimension array, where the first dimension holds the rows (every sample) and the 2nd dimension holds the 4
features compounding the features.

In [38]:
FEATURES = feats_train.shape[1]

Using the number of features that compounds every sample, we can use PCA algorithm to find the variance ratio from each feature.
First a PCA instance must be created:

In [39]:
pca = decomposition.PCA(n_components=FEATURES)

Then the `feats` (features `ndarray` of shape `(120, 4)`) is used to fill the PCA instance

In [40]:
pca.fit(feats)

Finally the variance ratio can be inspected for each component

In [41]:
VARIANCE_RATIO = pca.explained_variance_ratio_
print(VARIANCE_RATIO)

[0.92395437 0.05343362 0.01737228 0.00523974]


Each of these elements represents the variance in the set of values for each feature.

1. sepal length
2. sepal width
3. petal length
4. petal width

In [42]:
print(f"Sepal Variance: {VARIANCE_RATIO[0]} + {VARIANCE_RATIO[1]} = {VARIANCE_RATIO[0] + VARIANCE_RATIO[1]}")
print(f"Petal Variance: {VARIANCE_RATIO[2]} + {VARIANCE_RATIO[3]} = {VARIANCE_RATIO[2] + VARIANCE_RATIO[3]}")
print(f"Sum: {VARIANCE_RATIO[0] + VARIANCE_RATIO[1] + VARIANCE_RATIO[2] + VARIANCE_RATIO[3]}")

Sepal Variance: 0.9239543681440451 + 0.053433619322753645 = 0.9773879874667988
Petal Variance: 0.017372275759703716 + 0.005239736773497558 = 0.022612012533201276
Sum: 1.0


> sklearn library provides an example in this dataset using PCA. https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html

In [43]:
# ≅%97.73 of variance is represented by Sepal features.
START = 2
nsets = 10
nsamp = feats_train.shape[0]
new_feats = np.zeros((nsets*nsamp, feats_train.shape[1]))
new_label = np.zeros(nsets*nsamp, dtype="uint8")

for i in range(nsets):
    if (i == 0):
        new_feats[0:nsamp,:] = feats_train
        new_label[0:nsamp] = label_train
    else:
        new_feats[(i*nsamp):(i*nsamp+nsamp),:] = generate_data(pca, feats_train, START)
        new_label[(i*nsamp):(i*nsamp+nsamp)] = label_train

idx = np.argsort(np.random.random(nsets*nsamp))
new_feats = new_feats[idx]
new_label = new_label[idx]