<a href="https://colab.research.google.com/github/ICRAR/PHYS5511/blob/master/2019/week07/PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Original Code
https://www.datacamp.com/community/tutorials/principal-component-analysis-in-python

# Breast Cancer Data

The Breast Cancer data set is a real-valued multivariate data that consists of two classes, where each class signifies whether a patient has breast cancer or not. The two categories are: malignant and benign.

The malignant class has 212 samples, whereas the benign class has 357 samples.

It has 30 features shared across all classes: radius, texture, perimeter, area, smoothness, fractal dimension, etc.

Let's first explore the Breast Cancer dataset.

You will use sklearn's module datasets and import the Breast Cancer dataset from it.



In [0]:
from sklearn.datasets import load_breast_cancer

`load_breast_cancer` will give you both labels and the data. To fetch the data, you will call `.data` and for fetching the labels .target`.

The data has 569 samples with thirty features, and each sample has a label associated with it. There are two labels in this dataset.

In [0]:
breast = load_breast_cancer()


In [0]:
breast_data = breast.data

Let's check the shape of the data.



In [0]:
breast_data.shape

Even though for this tutorial, you do not need the labels but still for better understanding, let's load the labels and check the shape.



In [0]:
breast_labels = breast.target
breast_labels.shape

Now you will `import numpy` since you will be reshaping the `breast_labels` to concatenate it with the breast_data so that you can finally create a `DataFrame` which will have both the data and labels.



In [0]:
import numpy as np
labels = np.reshape(breast_labels,(569,1))

After `reshaping` the labels, you will `concatenate` the data and labels along the second axis, which means the final shape of the array will be 569 x 31.



In [0]:
final_breast_data = np.concatenate([breast_data,labels],axis=1)
final_breast_data.shape

Now you will import `pandas` to create the `DataFrame` of the final data to represent the data in a tabular fashion.



In [0]:
import pandas as pd
breast_dataset = pd.DataFrame(final_breast_data)


Let's quickly print the features that are there in the breast cancer dataset!



In [0]:
features = breast.feature_names
features



If you note in the `features` array, the `label` field is missing. Hence, you will have to manually add it to the `features` array since you will be equating this array with the column names of your `breast_dataset` dataframe.



In [0]:
features_labels = np.append(features,'label')


Now you will embed the column names to the `breast_dataset` dataframe.



In [0]:
breast_dataset.columns = features_labels


Let's print the first few rows of the dataframe.



In [0]:
breast_dataset.head()

Since the original labels are in `0,1` format, you will change the labels to `benign` and `malignant` using `.replace` function. You will use `inplace=True` which will modify the dataframe `breast_dataset`.



In [0]:
breast_dataset['label'].replace(0, 'Benign',inplace=True)
breast_dataset['label'].replace(1, 'Malignant',inplace=True)

Let's print the last few rows of the `breast_dataset`.

In [0]:
breast_dataset.tail()

## Visualizing the Breast Cancer data
You start by Standardising the data since PCA's output is influenced based on the scale of the features of the data.

It is a common practice to normalize your data before feeding it to any machine learning algorithm.

To apply normalization, you will import `StandardScaler` module from the sklearn library and select only the features from the `breast_dataset` you created in the Data Exploration step. Once you have the features, you will then apply scaling by doing `fit_transform` on the feature data.

While applying `StandardScaler`, each feature of your data should be normally distributed such that it will scale the distribution to a mean of zero and a standard deviation of one.

In [0]:
from sklearn.preprocessing import StandardScaler
x = breast_dataset.loc[:, features].values
x = StandardScaler().fit_transform(x) # normalising the features

In [0]:
x.shape

Let's check whether the normalised data has a mean of zero and a standard deviation of one.

In [0]:
np.mean(x), np.std(x)

Let's convert the normalised features into a tabular format with the help of DataFrame.

In [0]:
feat_cols = ['feature'+str(i) for i in range(x.shape[1])]
normalised_breast = pd.DataFrame(x,columns=feat_cols)
normalised_breast.tail()

Now comes the critical part, the next few lines of code will be projecting the thirty-dimensional Breast Cancer data to two-dimensional **principal components**.

You will use the sklearn library to import the PCA module, and in the PCA method, you will pass the number of components (n_components=2) and finally call fit_transform on the aggregate data. Here, several components represent the lower dimension in which you will project your higher dimension data.

In [0]:
from sklearn.decomposition import PCA
pca_breast = PCA(n_components=2)
principalComponents_breast = pca_breast.fit_transform(x)

Next, let's create a DataFrame that will have the principal component values for all 569 samples.

In [0]:
principal_breast_Df = pd.DataFrame(data = principalComponents_breast, columns = ['principal component 1', 'principal component 2'])
principal_breast_Df.tail()

Once you have the principal components, you can find the `explained_variance_ratio`. It will provide you with the amount of information or variance each principal component holds after projecting the data to a lower dimensional subspace.

In [0]:
print(f'Explained variation per principal component: {pca_breast.explained_variance_ratio_}')

From the above output, you can observe that the *principal component 1* holds 44.2% of the information while the *principal component 2* holds only 19% of the information. Also, the other point to note is that while projecting thirty-dimensional data to a two-dimensional data, 36.8% information was lost.

Let's plot the visualization of the 569 samples along the *principal component - 1* and *principal component - 2* axis. It should give you good insight into how your samples are distributed among the two classes.

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

In [0]:
plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of Breast Cancer Dataset",fontsize=20)
targets = ['Benign', 'Malignant']
colors = ['r', 'g']
for target, color in zip(targets,colors):
    indicesToKeep = breast_dataset['label'] == target
    plt.scatter(principal_breast_Df.loc[indicesToKeep, 'principal component 1']
               , principal_breast_Df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)

plt.legend(targets,prop={'size': 15})

From the above graph, you can observe that the two classes **benign** and **malignant**, when projected to a two-dimensional space, can be linearly separable up to some extent. Other observations can be that the **benign** class is spread out as compared to the **malignant** class.

# CIFAR - 10
The CIFAR-10 (Canadian Institute For Advanced Research) dataset consists of 60000 images each of 32x32x3 color images having ten classes, with 6000 images per category.

The dataset consists of 50000 training images and 10000 test images.

The classes in the dataset are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.

You can load the CIFAR - 10 dataset using Keras.

In [0]:
from keras.datasets import cifar10


Once imported, you will use the `.load_data()` method to download the data, it will download and store the data in your Keras directory.

In [0]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()


The above line of code returns training and test images along with the labels.



Let's quickly print the shape of training and testing images shape.



In [0]:
print('Traning data shape:', x_train.shape)
print('Testing data shape:', x_test.shape)

Let's also print the shape of the labels.



In [0]:
y_train.shape,y_test.shape


Let's also find out the total number of labels and the various kinds of classes the data has.

In [0]:
# Find the unique numbers from the train labels
classes = np.unique(y_train)
nClasses = len(classes)
print('Total number of outputs : ', nClasses)
print('Output classes : ', classes)

For a better understanding, let's create a dictionary that will have class names with their corresponding categorical class labels.

In [0]:
label_dict = {
 0: 'airplane',
 1: 'automobile',
 2: 'bird',
 3: 'cat',
 4: 'deer',
 5: 'dog',
 6: 'frog',
 7: 'horse',
 8: 'ship',
 9: 'truck',
}

In [0]:
plt.figure(figsize=[5,5])

# Display the first image in training data
plt.subplot(121)
curr_img = np.reshape(x_train[0], (32,32,3))
plt.imshow(curr_img)
print(plt.title("(Label: " + str(label_dict[y_train[0][0]]) + ")"))

# Display the first image in testing data
plt.subplot(122)
curr_img = np.reshape(x_test[0],(32,32,3))
plt.imshow(curr_img)
print(plt.title("(Label: " + str(label_dict[y_test[0][0]]) + ")"))

Even though the above two images are blurry, you can still somehow observe that the first image is a *frog* with the label *frog*, while the second image is of a *cat* with the label *cat*.

The following lines of code for visualizing the CIFAR-10 data is pretty similar to the PCA visualization of the Breast Cancer data.

Let's quickly check the maximum and minimum values of the CIFAR-10 training images and normalize the pixels between 0 and 1 inclusive.

In [0]:
np.min(x_train), np.max(x_train)

In [0]:
x_train = x_train/255.0
np.min(x_train),np.max(x_train)

In [0]:
x_train.shape

Next, you will create a DataFrame that will hold the pixel values of the images along with their respective labels in a row-column format.

But before that, let's reshape the image dimensions from three to one (flatten the images).

In [0]:
x_train_flat = x_train.reshape(-1, 32 * 32 * 3)
feat_cols = ['pixel'+str(i) for i in range(x_train_flat.shape[1])]
df_cifar = pd.DataFrame(x_train_flat,columns=feat_cols)
df_cifar['label'] = y_train
print('Size of the dataframe: {}'.format(df_cifar.shape))

The size of the dataframe is correct since there are 50,000 training images, each having 3072 pixels and an additional column for labels so in total 3073.

PCA will be applied on all the columns except the last one, which is the label for each image.

In [0]:
df_cifar.head()

Next, you will create the PCA method and pass the number of components as two and apply `fit_transform` on the training data, this can take few seconds since there are 50,000 samples

In [0]:
pca_cifar = PCA(n_components=2)
principalComponents_cifar = pca_cifar.fit_transform(df_cifar.iloc[:,:-1])

Then you will convert the principal components for each of the 50,000 images from a numpy array to a pandas DataFrame.

In [0]:
principal_cifar_Df = pd.DataFrame(data = principalComponents_cifar, columns = ['principal component 1', 'principal component 2'])
principal_cifar_Df['y'] = y_train
principal_cifar_Df.head()

Let's quickly find out the amount of information or variance the principal components hold.

In [0]:
print(f'Explained variation per principal component: {pca_cifar.explained_variance_ratio_}')

Well, it looks like a decent amount of information was retained by the principal components 1 and 2, given that the data was projected from 3072 dimensions to a mere two principal components.

It's time to visualize the CIFAR-10 data in a two-dimensional space. Remember that there is some semantic class overlap in this dataset which means that a frog can have a slightly similar shape of a cat or a deer with a dog; especially when projected in a two-dimensional space. The differences between them might not be captured that well.

In [0]:
import seaborn as sns
plt.figure(figsize=(16,10))
sns.scatterplot(
    x="principal component 1", y="principal component 2",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=principal_cifar_Df,
    legend="full",
    alpha=0.3
)

From the above figure, you can observe that some variation was captured by the principal components since there is some structure in the points when projected along the two principal component axis. The points belonging to the same class are close to each other, and the points or images that are very different semantically are further away from each other.

# Speed Up Deep Learning Training using PCA with CIFAR - 10 Dataset

Now let's speed up your Deep Learning Model's training process using PCA.

First, let's normalize the training and testing images. If you remember the training images were normalized in the PCA visualization part, so you only need to normalize the testing images.

In [0]:
x_test = x_test/255.0
x_test = x_test.reshape(-1,32,32,3)

Let's reshape the test data.

In [0]:
x_test_flat = x_test.reshape(-1,3072)

Next, you will make the instance of the PCA model.

Here, you can also pass how much variance you want PCA to capture. Let's pass 0.9 as a parameter to the PCA model, which means that PCA will hold 90% of the variance and the number of components required to capture 90% variance will be used.

Note that earlier you passed n_components as a parameter and you could then find out how much variance was captured by those two components. But here we explicitly mention how much variance we would like PCA to capture and hence, the n_components will vary based on the variance parameter.

If you do not pass any variance, then the number of components will be equal to the original dimension of the data.

In [0]:
pca = PCA(0.9)

Then you will fit the PCA instance on the training images.

In [0]:
pca.fit(x_train_flat)

Now let's find out how many n_components PCA used to capture 0.9 variance.

In [0]:
pca.n_components_

From the above output, you can observe that to achieve 90% variance, the dimension was reduced to 99 principal components from the actual 3072 dimensions.

Finally, you will apply transform on both the training and test set to generate a transformed dataset from the parameters generated from the fit method.

In [0]:
train_img_pca = pca.transform(x_train_flat)
test_img_pca = pca.transform(x_test_flat)

Next, let's quickly import the necessary libraries to run the deep learning model.

In [0]:
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from keras.optimizers import RMSprop

Now, you will convert your training and testing labels to one-hot encoding vector.

In [0]:
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)

Let's define the number of epochs, number of classes, and the batch size for your model; and the model.

In [0]:
batch_size = 128
num_classes = 10
epochs = 20
model = Sequential()
model.add(Dense(1024, activation='relu', input_shape=(99,)))
model.add(Dense(1024, activation='relu'))
model.add(Dense(512, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

Let's print the model summary.

In [0]:
model.summary()

Finally, it's time to compile and train the model!

In [0]:
model.compile(
    loss='categorical_crossentropy',
    optimizer=RMSprop(),
    metrics=['accuracy'])

history = model.fit(
    train_img_pca, 
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=1,
    validation_data=(test_img_pca, y_test))

From the above output, you can observe that the time taken for training each epoch was just 20 seconds on a CPU. The model did a decent job on the training data, achieving 90ish% accuracy while it achieved only 55ish% accuracy on the test data (don't forget each of your runs will be different to mine). This means that it overfitted the training data. However, remember that the data was projected to 99 dimensions from 3072 dimensions and despite that it did a great job!

Finally, let's see how much time the model takes to train on the original dataset and how much accuracy it can achieve using the same deep learning model.

In [0]:
model = Sequential()
model.add(Dense(1024, activation='relu', input_shape=(3072,)))
model.add(Dense(1024, activation='relu'))
model.add(Dense(512, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
model.compile(
    loss='categorical_crossentropy',
    optimizer=RMSprop(),
    metrics=['accuracy'])

history = model.fit(
    x_train_flat, 
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=1,
    validation_data=(x_test_flat, y_test))

From the above output, it is quite evident that the time taken for training each epoch was around 45 seconds on a CPU which was two times more than the model trained on the PCA output.

Moreover, both the training and testing accuracy is less than the accuracy you achieved with the 99 principal components as an input to the model.

So, by applying PCA on the training data you were able to train your deep learning algorithm not only fast, but it also achieved better accuracy on the testing data when compared with the deep learning algorithm trained with original training data.