The Breast Cancer Wisconsin (Diagnostic) dataset is a well-known and straightforward binary classification dataset in the field of machine learning. It is frequently utilized to showcase the performance of various algorithms. This dataset is included within the sklearn package and comprises a total of 569 samples, with measurements of 30 distinct features pertaining to breast cancer cell nuclei. The primary objective is to predict whether a tumor is malignant or benign. A comprehensive guide to this dataset can be accessed at the following link: https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset

For this assignment, you are required to apply clustering algorithms and Principal Component Analysis (PCA) to the dataset and address the subsequent questions. The following code snippet loads the dataset, divide it into training and testing sets and normalize the features for subsequent analysis. Please do not change the following code. **Please note that for this assignment, you are not required to create a validation set**

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load the breast cancer dataset
data = load_breast_cancer()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.5, random_state=20)

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Displaying a few rows of the training dataset
print("Training Data:")
print(pd.DataFrame(X_train, columns=data.feature_names).head())
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)


# Displaying a few rows of the testing dataset
print("\nTesting Data:")
print(pd.DataFrame(X_test, columns=data.feature_names).head())
print("Shape of X_test:", X_test.shape)
print("Shape of y_test:", y_test.shape)


Training Data:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0    -1.309889      0.377965       -1.292508  -1.069139         0.615460   
1    -0.786488      0.435374       -0.794114  -0.720053        -0.639346   
2    -0.406847     -1.651995       -0.455673  -0.449519        -0.619938   
3    -0.730658     -1.130727       -0.706683  -0.702475         0.275072   
4     1.379695      1.590430        1.487140   1.300031         1.891913   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0         -0.483415       -0.808067            -0.850139      -0.298225   
1         -0.710875       -0.457596            -0.750320      -1.147102   
2         -0.859161       -0.785815            -0.623208      -0.792788   
3          0.167257       -0.267742            -0.574688       0.074543   
4          1.441261        1.227533             2.482059      -0.604559   

   mean fractal dimension  ...  worst radius  worst texture  worst perimeter 

In [None]:
# Observing testing dataset
print("Example of Testing dataset",X_test[284])
print("Label of example",y_test[284])

# Observing training datatset
print("Example of Training datatset",X_train[283])
print("Label of example",y_train[283])


Example of Testing dataset [ 1.67559157  0.0679602   1.56772148  1.69004032 -1.19098342 -0.33323099
  0.30193729  0.7222055   0.48052816 -1.78174259  0.62707174 -0.94751083
  0.4335168   0.72474738 -0.84520638 -0.62940947 -0.19114284  0.0842971
  0.13027108 -0.79881947  1.4055736  -0.43739676  1.26662608  1.37745857
 -1.21113087 -0.64956075 -0.06735751  0.45190424  0.31102964 -1.41502759]
Label of example 0
Example of Training datatset [-0.445928   -0.06752358 -0.41377029 -0.4747868  -0.63860001  0.01156648
  0.15398403 -0.12057312 -0.9773269  -0.10980428 -0.15175017  0.3814993
  0.13920215 -0.27956743  0.89013833  0.87931274  0.80675296  1.16209828
  0.62539525  0.15635109 -0.59125278 -0.53415817 -0.5363061  -0.58014688
 -1.02181529 -0.33867557 -0.16901379 -0.32669872 -1.26320481 -0.67568267]
Label of example 1


# Clustering Algorithms

1.1 In this question, you will apply k-means algorithm to the **test set** directly. Report the **accuracy** of the algorithm by comparing the results to the known labels (use accuracy_score function in sklearn). **Note**: You are allowed to use sklearn package for algorithm implementation in this question. As the k-means algorithm is an unsupervised learning method, it does not utilize labels from the data. Consequently, the categories might be flipped (0 becomes 1, and 1 becomes 0). To address this issue, you can calculate both accuracy_score(y_test, kmeans.labels_) and accuracy_score(y_test, 1 - kmeans.labels_), then report the higher value.

In [None]:
# TODO: Please provide your answers here. You can add more cells for code or texts if needed.
# Don't delete these two comments so that it would be easier for me to locate your answers.

from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
import numpy as np

# Apply k-means algorithm to the test set
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(X_test)

# Predicted labels (original and flipped)
y_pred = kmeans.labels_
y_pred_flipped = 1 - kmeans.labels_

# Calculate accuracy scores for both label assignments
accuracy_original = accuracy_score(y_test, y_pred)
accuracy_flipped = accuracy_score(y_test, y_pred_flipped)

# Report the higher accuracy score
higher_accuracy = max(accuracy_original, accuracy_flipped)

print("Accuracy (Original Labels): {:.4f}".format(accuracy_original))
print("Accuracy (Flipped Labels): {:.4f}".format(accuracy_flipped))
print("Higher Accuracy: {:.4f}".format(higher_accuracy))




Accuracy (Original Labels): 0.9123
Accuracy (Flipped Labels): 0.0877
Higher Accuracy: 0.9123




1.2 In this question, you will apply the gaussian mixture model (GMM) to the **test set** directly. Report the **accuracy** of the algorithm by comparing the results to the known labels (use accuracy_score function in sklearn). **Note**: You are allowed to use sklearn package for algorithm implementation in this question. As the GMM is an unsupervised learning method, it does not utilize labels from the data. Consequently, the categories might be flipped (0 becomes 1, and 1 becomes 0). To address this issue, you can calculate both accuracy_score(y_test, kmeans.labels_) and accuracy_score(y_test, 1 - kmeans.labels_), then report the higher value.

In [None]:
# TODO: Please provide your answers here. You can add more cells for code or texts if needed.
# Don't delete these two comments so that it would be easier for me to locate your answers.

from sklearn.mixture import GaussianMixture
from sklearn.metrics import accuracy_score

# Apply GMM to the test set
gmm = GaussianMixture(n_components=2, random_state=0)
gmm.fit(X_test)

# Predicted probabilities for each cluster
probabilities = gmm.predict_proba(X_test)

#Assign labels based on the cluster with the highest probability
y_pred = np.argmax(probabilities, axis=1)
y_pred_flipped = 1 - y_pred  # Flip labels

# Calculate accuracy scores for both label assignments
accuracy_original = accuracy_score(y_test, y_pred)
accuracy_flipped = accuracy_score(y_test, y_pred_flipped)

# Report the higher accuracy score
higher_accuracy = max(accuracy_original, accuracy_flipped)

print("Accuracy (Original Labels): {:.4f}".format(accuracy_original))
print("Accuracy (Flipped Labels): {:.4f}".format(accuracy_flipped))
print("Higher Accuracy: {:.4f}".format(higher_accuracy))


Accuracy (Original Labels): 0.9333
Accuracy (Flipped Labels): 0.0667
Higher Accuracy: 0.9333


1.3 Compare and comment on the results

In [None]:
# TODO: Please provide your answers here. You can add more cells for code or texts if needed.
# Don't delete these two comments so that it would be easier for me to locate your answers.

K-means algorithm seems to be a little bit lesser efficient in clustering when compared to GMM. Reported accuracy for GMM using the same dataset was 0.933 while
K-means algorithm showed a accuracy of 0.912. However, it should be noted that both models have done pretty well job on the given dataset.

It should be noted that GMM is considered to be more robust and efficient as it offers greater flexibility in modeling complex cluster structures,
provides probabilistic assignments, and is less sensitive to initialization compared to K-means.

# Principal Component Analysis

2.1 Apply Principal Component Analysis (PCA) to the **training set** using **numpy package** only. Keep only one principal component. Report the percentage of variance explained by the first component.

In [None]:
# TODO: Please provide your answers here. You can add more cells for code or texts if needed.
# Don't delete these two comments so that it would be easier for me to locate your answers.
import numpy as np

# Standardized training data (already obtained from StandardScaler)
X_train_standardized = X_train

# Calculate the covariance matrix
cov_matrix = np.cov(X_train_standardized, rowvar=False)

# Perform eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Keep only the first principal component
first_principal_component = eigenvectors[:, 0]

# Calculate the percentage of variance explained
explained_variance_ratio = eigenvalues[0] / np.sum(eigenvalues)
percentage_variance_explained = explained_variance_ratio * 100

print("Percentage of Variance Explained by the First Principal Component: {:.2f}%".format(percentage_variance_explained))


Percentage of Variance Explained by the First Principal Component: 45.39%


2.2 Apply Principal Component Analysis (PCA) to the **training set** using **sklearn** package. Keep only one principal component. Report the percentage of variance explained by the first component.

In [None]:
# TODO: Please provide your answers here. You can add more cells for code or texts if needed.
# Don't delete these two comments so that it would be easier for me to locate your answers.
from sklearn.decomposition import PCA

# Create a PCA instance and fit it to the standardized training data
pca = PCA(n_components=1)
X_train_pca = pca.fit_transform(X_train)

# Percentage of variance explained by the first component
variance_explained = pca.explained_variance_ratio_[0] * 100

print("Percentage of Variance Explained by the First Principal Component: {:.2f}%".format(variance_explained))


Percentage of Variance Explained by the First Principal Component: 45.39%


2.2 Build a basic Support Vector Machine (SVM) model using both the original normalized features and features transformed by Principal Component Analysis (PCA), retaining 1, 5, 10, and 30 principal components. Compare the performance of these models on the test set. How does the number of principal components affect the performance of the model? Note: You are allowed to use the sklearn package for this question.

In [None]:
# TODO: Please provide your answers here. You can add more cells for code or texts if needed.
# Don't delete these two comments so that it would be easier for me to locate your answers.

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Function to build and evaluate SVM models
def build_and_evaluate_model(X_train, X_test, y_train, y_test):
    model = SVC(kernel='linear', random_state=0)

    # Fit the model
    model.fit(X_train, y_train)

    # Predictions on the test set
    y_pred = model.predict(X_test)

    # Calculate and return accuracy
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

# Original normalized features
accuracy_original = build_and_evaluate_model(X_train, X_test, y_train, y_test)
print("Accuracy with Original Features: {:.4f}".format(accuracy_original))

# Apply PCA with 1 principal component
pca_1 = PCA(n_components=1)
X_train_pca_1 = pca_1.fit_transform(X_train)
X_test_pca_1 = pca_1.transform(X_test)
accuracy_pca_1 = build_and_evaluate_model(X_train_pca_1, X_test_pca_1, y_train, y_test)
print("Accuracy with 1 Principal Component: {:.4f}".format(accuracy_pca_1))

# Apply PCA with 5 principal components
pca_5 = PCA(n_components=5)
X_train_pca_5 = pca_5.fit_transform(X_train)
X_test_pca_5 = pca_5.transform(X_test)
accuracy_pca_5 = build_and_evaluate_model(X_train_pca_5, X_test_pca_5, y_train, y_test)
print("Accuracy with 5 Principal Components: {:.4f}".format(accuracy_pca_5))

# Apply PCA with 10 principal components
pca_10 = PCA(n_components=10)
X_train_pca_10 = pca_10.fit_transform(X_train)
X_test_pca_10 = pca_10.transform(X_test)
accuracy_pca_10 = build_and_evaluate_model(X_train_pca_10, X_test_pca_10, y_train, y_test)
print("Accuracy with 10 Principal Components: {:.4f}".format(accuracy_pca_10))

# Apply PCA with 30 principal components
pca_30 = PCA(n_components=30)
X_train_pca_30 = pca_30.fit_transform(X_train)
X_test_pca_30 = pca_30.transform(X_test)
accuracy_pca_30 = build_and_evaluate_model(X_train_pca_30, X_test_pca_30, y_train, y_test)
print("Accuracy with 30 Principal Components: {:.4f}".format(accuracy_pca_30))


Accuracy with Original Features: 0.9719
Accuracy with 1 Principal Component: 0.9193
Accuracy with 5 Principal Components: 0.9614
Accuracy with 10 Principal Components: 0.9789
Accuracy with 30 Principal Components: 0.9719
