<a href="https://colab.research.google.com/github/2333amy/hds-blog/blob/main/Week-07-Exercise-PCA-copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](https://drive.google.com/uc?export=view&id=1DXUVHxd4t15mfuqMgMCLnsP4jWVI5EWz)

---

<br>
© 2022 Copyright The University of New South Wales - CRICOS 00098G

**Author**: Sebastiano Barbieri

**Contributors/Co-authors**: Oscar Perez-Concha, Marta Fredes-Torres and Zhisheng (Sandy) Sa.

# Week 7 - Exercise 1: PCA

# 1. Introduction

In order to understand this exercise, read the README.md file and the reading material for this week first. 

## 1.1. Aims:
1. Perform PCA using Scikit-Learn and singular value decomposition
2. Select an appropriate number of principal components by computing the proportion of variance explained
3. Understand what biplots represent
 
## 1.2. Jupyter Notebook Intructions
1. Read the content of each cell.
2. Where necessary, follow the instructions that are written in each cell.
3. Run/Execute all the cells that contain Python code sequentially (one at a time), using the "Run" button.
4. For those cells in which you are asked to write some code, please write the Python code first and then execute/run the cell.

In [1]:
# check required libraries are installed if not calling system to install
import sys
import subprocess
import pkg_resources

required = {'numpy', 'pandas', 'plotnine', 'matplotlib', 'seaborn', 
            'grid', 'lime', 'shap', 'scikit-learn', 'graphviz'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    print('Installing: ', missing)
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
# delete unwanted variables
del required 
del installed 
del missing

Installing:  {'lime', 'grid', 'shap'}


In [None]:
# Mount Google Drive
# We do not need to run this cell if you are not running this notebook in Google Colab

if 'google.colab' in str(get_ipython()):
    from google.colab import drive # import drive from Gogle colab
    root = '/content/drive'     # default location for the drive
    # print(root)                 # print content of ROOT (Optional)
    drive.mount(root)
else:
    print('Not running on CoLab')

In [None]:
from pathlib import Path

if 'google.colab' in str(get_ipython()):
    # EDIT THE PROJECT PATH IF DIFFERENT WITH YOUR ONE
    project_path = Path(root) / 'MyDrive' / 'HDAT9500' / 'week07'

    # OPTIONAL - set working directory according to your google drive project path
    # import os
    # Change directory to the location defined in project_path
    # os.chdir(project_path)
else:
    project_path = Path()

# 2. Docstring: 

Create a docstring with the variables and constants that you will use in this exercise (data dictionary) and the purpose of your program. It is expected that you choose informative variable names and document your program (both docstrings and comments).

<b> Write the answer here:</b>

#####################################################################################################################

(double-click here)


#####################################################################################################################

# 3. Dataset: Breast Cancer Wisconsin (Diagnostic) Data Set

## 3.1. Data Set Information:

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. 
This database is also available through the UW CS ftp server use following code: `ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/` and the UCI Machine Learning Repository for [the data dictionary and more information](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)

## 3.2. Attribute Information:

1. ID number
2. Diagnosis (M = malignant, B = benign)
3. to 32. Ten real-valued features are computed for each cell nucleus:


    a. radius (mean of distances from center to points on the perimeter)

    b. texture (standard deviation of gray-scale values)

    c. perimeter

    d. area

    e. smoothness (local variation in radius lengths)

    f. compactness (perimeter^2/area - 1.0)

    g. concavity (severity of concave portions of the contour)

    h. concave points (number of concave portions of the contour)

    i. symmetry
    
    j. fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

[Further information](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))

## 3.3. Import dataset

In [None]:
import numpy as np
import pandas as pd

Import the dataset and print it. Have a look at the dataset.

In [None]:
data_path = Path(project_path) / 'data' / 'breast-cancer-wisconsin-data' / 'data.csv'
bcw = pd.read_csv(data_path, sep=',')
print(bcw)

In [None]:
bcw.describe(include='all')

## 4. Principal Component Analysis

## 4.1. PCA Using Scikit-Learn

In [None]:
from sklearn.decomposition import PCA

In [None]:
from sklearn import preprocessing

Select the first <font color=green><b>10</b></font> numerical features in the dataset; these will be our feature vectors.

In [None]:
print(bcw.columns)
X = bcw[bcw.columns[2:12]]
X.describe()

Scale data to have zero mean and unit variance.

In [None]:
X_scaled = preprocessing.scale(X)
print(X_scaled.mean(axis=0))
print(X_scaled.std(axis=0))

PCA, keep first two principal components

[Read more on PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

In [None]:
pca = PCA(n_components = 2)
X2D = pca.fit_transform(X_scaled)
print(X2D)

Look at the loadings of the first two principal components.

In [None]:
print(pca.components_)

## 4.2. Biplot 

For more information regarding this function:

1. [Biplot function](https://sukhbinder.wordpress.com/2015/08/05/biplot-with-python/) and
[PCA](https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e)

2. [What is a biplot?](https://en.wikipedia.org/wiki/Biplot)

In [None]:
import matplotlib.pyplot as plt

# Class labels
y = bcw.loc[:, 'diagnosis']

# Define biplot
def biplot(score, coeff, labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    
    fig, ax = plt.subplots(figsize=(8,8))
    for g in np.unique(y):
        ix = np.where(y == g)
        ax.scatter(xs[ix] * scalex, ys[ix] * scaley, label = g)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1], color = 'r', alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))
    plt.legend(loc='best')
    plt.grid()

# Call the function. Use only the 2 principal components.
biplot(X2D, pca.components_.T, X.columns)
plt.show()

All the loadings of the first principal component are positive while for the second principal component some loadings are positive and some are negative.

## 4.3. Proportion of Variance Explained

Determine the proportion of variance explained by each of the first two principal components.

In [None]:
pve = pca.explained_variance_ratio_
print(pve)

The first dimension explains 54.8% of the variance, while the second explains 25.2%.
How much variance do both principal components explain?

In [None]:
print(sum(pve))

How many principal components do we need to explain 95% of the data variance?

In [None]:
# Compute PCA without reducing dimensionality
pca = PCA()
pca.fit(X_scaled)

# Plot the cumulative sum of explained variance
cumulative_sum = np.cumsum(pca.explained_variance_ratio_)
plt.plot(cumulative_sum)
plt.xlabel('Dimension')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()

# Compute the minimum number of dimensions d that explain 95% of the data variance
d = np.argmax(cumulative_sum >= 0.95)+1
print("Minimum number of dimensions that explain 95% of the data variance:")
print(d)

We could now re-run pca with n_components=d. However, you can also set n_components to be a float between 0.0 and 1.0 indicating the proportion of variance you wish to preserve.

In [None]:
pca = PCA(n_components=0.95)
XDD = pca.fit_transform(X_scaled)
print(XDD)

Congratulations, you now know how to reduce the dimensionality of any dataset down to any number of dimensions, while preserving as much variance as possible!

© 2022 Copyright The University of New South Wales - CRICOS 00098G