# **Machine Learning in Python: Performing Principal Component Analysis (PCA) on Breast Cancer Wisconsin Diagnostic dataset**

In this Jupyter notebook, we will be performing Principal Component Analysis (PCA) using the Breast Cancer Wisconsin Diagnostic data set as an example.

---

## **1. Breast Cancer Wisconsin Diagnostic data set**

### Load library

In [1]:
from sklearn import datasets

### Load dataset

In [2]:
bc = datasets.load_breast_cancer()

In [14]:
bc

 'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'feature_names': array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
        'mean smoothness', 'mean compactness', 'mean concavity',
        'mean concave points', 'mean symmetry', 'mean fractal dimension',
        'radius error', 'texture error', 'perimeter error', 'area error',
        'smoothness error', 'compactness error', 'concavity error',
        'concave points error', 'symmetry error',
        'fractal di

### Input features

In [3]:
print(bc.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


### Output features

In [4]:
print(bc.target_names)

['malignant' 'benign']


### Assigning Input (X) and Output (Y) variables
Let's assign the 30 input variables to X and the output variable (class label) to Y

In [5]:
X = bc.data
Y = bc.target

### Let's examine the data dimension

In [6]:
X.shape

(569, 30)

In [7]:
Y.shape

(569,)

---

## **2. PCA analysis**

### 2.1. Load library

In [8]:
from sklearn.preprocessing import scale # Data scaling
from sklearn import decomposition #PCA
import pandas as pd # pandas

### 2.2. Data scaling

In [9]:
X = scale(X)

### 2.3. Perform PCA analysis

Here we define the number of PC to use as 3

In [10]:
pca = decomposition.PCA(n_components=3)
pca.fit(X)


PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

#### 2.4. Compute and retrieve the **scores** values

In [11]:
scores = pca.transform(X)

In [12]:
scores_df = pd.DataFrame(scores, columns=['PC1', 'PC2', 'PC3'])
scores_df

Unnamed: 0,PC1,PC2,PC3
0,9.192837,1.948583,-1.123166
1,2.387802,-3.768172,-0.529293
2,5.733896,-1.075174,-0.551748
3,7.122953,10.275589,-3.232790
4,3.935302,-1.948072,1.389767
...,...,...,...
564,6.439315,-3.576817,2.459487
565,3.793382,-3.584048,2.088476
566,1.256179,-1.902297,0.562731
567,10.374794,1.672010,-1.877029


In [16]:
Y_label = []

for i in Y:
  if i == 0:
    Y_label.append('malignant')
  else:
    Y_label.append('benign')

diagnosis = pd.DataFrame(Y_label, columns=['diagnosis'])

In [17]:
df_scores = pd.concat([scores_df, diagnosis], axis=1)

#### 2.5. Retrieve the **loadings** values

In [18]:
loadings = pca.components_.T
df_loadings = pd.DataFrame(loadings, columns=['PC1', 'PC2','PC3'], index=bc.feature_names)
df_loadings

Unnamed: 0,PC1,PC2,PC3
mean radius,0.218902,-0.233857,-0.008531
mean texture,0.103725,-0.059706,0.06455
mean perimeter,0.227537,-0.215181,-0.009314
mean area,0.220995,-0.231077,0.0287
mean smoothness,0.14259,0.186113,-0.104292
mean compactness,0.239285,0.151892,-0.074092
mean concavity,0.2584,0.060165,0.002734
mean concave points,0.260854,-0.034767,-0.025564
mean symmetry,0.138167,0.190349,-0.04024
mean fractal dimension,0.064363,0.366575,-0.022574


#### 2.6. **Explained variance** for each PC

In [19]:
explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.44272026, 0.18971182, 0.09393163])

## **3. Scree Plot**

### 3.1. Import library

In [20]:
import numpy as np
import plotly.express as px

### 3.2. Preparing explained variance and cumulative variance

#### 3.2.1. Preparing the explained variance data

In [21]:
explained_variance

array([0.44272026, 0.18971182, 0.09393163])

In [22]:
explained_variance = np.insert(explained_variance, 0, 0)

#### 3.2.2. Preparing the cumulative variance data

In [23]:
cumulative_variance = np.cumsum(np.round(explained_variance, decimals=3))

#### 3.2.3. Combining the dataframe

In [24]:
pc_df = pd.DataFrame(['','PC1', 'PC2', 'PC3'], columns=['PC'])
explained_variance_df = pd.DataFrame(explained_variance, columns=['Explained Variance'])
cumulative_variance_df = pd.DataFrame(cumulative_variance, columns=['Cumulative Variance'])

In [25]:
df_explained_variance = pd.concat([pc_df, explained_variance_df, cumulative_variance_df], axis=1)
df_explained_variance

Unnamed: 0,PC,Explained Variance,Cumulative Variance
0,,0.0,0.0
1,PC1,0.44272,0.443
2,PC2,0.189712,0.633
3,PC3,0.093932,0.727


#### 3.2.4. Making the scree plot

##### 3.2.4.1. Explained Variance

In [26]:
# https://plotly.com/python/bar-charts/

fig = px.bar(df_explained_variance, 
             x='PC', y='Explained Variance',
             text='Explained Variance',
             width=800)

fig.update_traces(texttemplate='%{text:.3f}', textposition='outside')
fig.show()

##### 3.2.4.2. Explained Variance + Cumulative Variance

In [28]:
# https://plotly.com/python/creating-and-updating-figures/

import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=df_explained_variance['PC'],
        y=df_explained_variance['Cumulative Variance'],
        marker=dict(size=15, color="LightSeaGreen")
    ))

fig.add_trace(
    go.Bar(
        x=df_explained_variance['PC'],
        y=df_explained_variance['Explained Variance'],
        marker=dict(color="RoyalBlue")
    ))

fig.show()

##### 3.2.4.3. Explained Variance + Cumulative Variance (Separate Plot)

In [29]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Scatter(
        x=df_explained_variance['PC'],
        y=df_explained_variance['Cumulative Variance'],
        marker=dict(size=15, color="LightSeaGreen")
    ), row=1, col=1
    )

fig.add_trace(
    go.Bar(
        x=df_explained_variance['PC'],
        y=df_explained_variance['Explained Variance'],
        marker=dict(color="RoyalBlue"),
    ), row=1, col=2
    )

fig.show()

## **4. Scores Plot**

Source: https://plotly.com/python/3d-scatter-plots/

### 4.1. Load library
[API Documentation](https://plotly.com/python-api-reference/plotly.express.html) for *plotly.express* package

In [30]:
import plotly.express as px

### 4.2. Basic 3D Scatter Plot

In [32]:
fig = px.scatter_3d(df_scores, x='PC1', y='PC2', z='PC3',
              color='diagnosis')

fig.show()

### 4.3. Customized 3D Scatter Plot

In [33]:
fig = px.scatter_3d(df_scores, x='PC1', y='PC2', z='PC3',
              color='diagnosis',
              symbol='diagnosis',
              opacity=0.5)

# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))

# https://plotly.com/python/templates/
#fig.update_layout(template='plotly_white') # "plotly", "plotly_white", "plotly_dark", "ggplot2", "seaborn", "simple_white", "none"

## **5. Loadings Plot**

In [34]:
loadings_label = df_loadings.index
# loadings_label = df_loadings.index.str.strip(' (cm)')

fig = px.scatter_3d(df_loadings, x='PC1', y='PC2', z='PC3',
                    text = loadings_label)

fig.show()

---