DEV AGARWAL 220968019 WEEK 8 PCA

### Required Imports

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.metrics import *
from sklearn.preprocessing import scale # Equivalent of StandardScaler()

# Dataset imports
from sklearn.datasets import load_breast_cancer
from keras.datasets import cifar10

### Tasks

### 1) Breast Cancer Data

- Obtain the dataset.

In [2]:
cancer_data: pd.DataFrame = load_breast_cancer(as_frame=True).frame
cancer_data

- View dataset metadata.

In [3]:
cancer_data.info()

- Preprocessing data:

*Replacing number values with respective categories in `target` column.*

In [4]:
# As per https://www.datacamp.com/tutorial/principal-component-analysis-in-python
cancer_data["target"] = cancer_data.target.replace({
    0: "Benign",
    1: "Malignant"
}).astype("category")
cancer_data

- Defining feature vector and target variable:

*Feature vector:*

In [5]:
x: pd.DataFrame = cancer_data.drop(columns=["target"])
x

*Target variable:*

In [6]:
y: pd.Series = cancer_data.target
y

- Normalize feature vector.

In [7]:
x = pd.DataFrame(scale(x), index=x.index, columns=x.columns)
x

- Derive principal components from the feature vector.

*Using number of components as 2 from the tutorial.*

In [8]:
num_components: int = 2
pca_model: PCA = PCA(n_components=num_components)
x_new: pd.DataFrame = pd.DataFrame(
    pca_model.fit_transform(x),
    index=x.index,
    columns=[f"Principal Component {i}" for i in range(1, num_components+1)]
)
x_new

- Find explained variance ratio. (It is the amount of information that is present in each principal component.)

In [9]:
pd.DataFrame(
    pca_model.explained_variance_ratio_,
    index=x_new.columns,
    columns=["Explained Variance Ratio"]
)

*Information lost:*

In [10]:
1 - pca_model.explained_variance_ratio_.sum()

- Plot the points obtained after feature extraction.

In [27]:
sns.scatterplot(
    x=x_new.iloc[:, 0],
    y=x_new.iloc[:, 1],
    hue=y
)

### 2) CIFAR - 10 Data

- Obtain the dataset.

In [12]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

*Converting target arrays to pandas Series for convenience.*

In [13]:
y_train = pd.Series(y_train[:, 0], name="target")
y_test = pd.Series(y_test[:, 0], name="target")

*Replacing number values with respective categories in target variable.*

In [14]:
# As per https://www.datacamp.com/tutorial/principal-component-analysis-in-python
label_dict: dict[int, str] = {
 0: 'airplane',
 1: 'automobile',
 2: 'bird',
 3: 'cat',
 4: 'deer',
 5: 'dog',
 6: 'frog',
 7: 'horse',
 8: 'ship',
 9: 'truck',
}
y_train.replace(label_dict, inplace=True)
y_test.replace(label_dict, inplace=True)

- Visualize some data from the dataset.

In [15]:
plt.figure(figsize=(5, 15))
# Visualizing 10 images
for i in range(10):
    plt.subplot(5, 2, i+1)
    plt.imshow(x_train[i])
    plt.legend(
        [], # To fool matplotlib into not erroring
        title=f"Label: {y_train[i]}"
    )

- Normalize feature vector.

*Using min-max normalization: (Since all values in the array are color information (i.e. RGB within 0-255), we divide all values by 255.)*

In [16]:
x_train = x_train / 255
x_test = x_test / 255

- Check shape of the dataset.

In [17]:
x_train.shape, x_test.shape

*It is an array of images where each image is a 32×32 matrix of pixels with RGB color values.*

- Make each row of the data 1D so that we can construct a dataframe from it.

In [18]:
x_train = x_train.reshape(-1, 32*32*3)
x_test = x_test.reshape(-1, 32*32*3)

*Each value holds the respective RGB value for each pixel.*

- Construct dataframe for the data.

In [19]:
train_data:pd.DataFrame = pd.DataFrame(x_train, columns=[f"RGBVal{i}" for i in range(x_train.shape[1])])
train_data["target"] = y_train
train_data

In [20]:
test_data: pd.DataFrame = pd.DataFrame(x_test, columns=[f"RGBVal{i}" for i in range(x_test.shape[1])])
test_data["target"] = y_test
test_data

*Merging dataframes into a single one to perform PCA on:*

In [21]:
cifar10_data: pd.DataFrame = pd.concat([train_data, test_data]).reset_index().drop(columns=["index"])
cifar10_data

- Derive principal components from the feature vector.

*Declaring feature vector and target variable.*

In [22]:
x: pd.DataFrame = cifar10_data.drop(columns=["target"])
y: pd.Series = cifar10_data.target

*Performing PCA:*

In [23]:
num_components: int = 2
pca_model: PCA = PCA(n_components=num_components)
x_new: pd.DataFrame = pd.DataFrame(
    pca_model.fit_transform(x),
    index=x.index,
    columns=[f"Principal Component {i}" for i in range(1, num_components+1)]
)
x_new

- Find explained variance ratio. (It is the amount of information that is present in each principal component.)

In [24]:
pd.DataFrame(
    pca_model.explained_variance_ratio_,
    index=x_new.columns,
    columns=["Explained Variance Ratio"]
)

*Information lost:*

In [25]:
1 - pca_model.explained_variance_ratio_.sum()

- Plot the points obtained after feature extraction.

In [26]:
plt.figure(figsize=(16, 10))
sns.scatterplot(
    x=x_new.iloc[:, 0],
    y=x_new.iloc[:, 1],
    hue=y
)