# Homework Assignment 2 - Chem 277B
## Breast Cancer Prediction

**1) Objective**

Analyze the Breast Cancer Wisconsin dataset, classify cancer cells using Naive Bayes models, and evaluate performance using a confusion matrix.

<br>

**2) Preparation**

Before starting, import the necessary libraries for data analysis and visualization. 

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

<br>

**3) Dataset**

Load the dataset from `sklearn`, which contains cell descriptors and a target variable (diagnosis: 0 = malignant, 1 = benign).

In [None]:
data, target = load_breast_cancer(return_X_y=True, as_frame=True)

In [None]:
data

In [None]:
target

Split the dataset into training (80%) and testing (20%) sets. Use `random_state=42` for reproducibility.

In [None]:
######## Fill in the code below ########

########################################

print(f'Number of training samples: {X_train.shape[0]}')
print(f'- Number of malignant samples: {y_train[y_train==0].shape[0]} ({y_train[y_train==0].shape[0]/y_train.shape[0]*100:.1f}%)')
print(f'- Number of benign samples: {y_train[y_train==1].shape[0]} ({y_train[y_train==1].shape[0]/y_train.shape[0]*100:.1f}%)')
print(f'Number of test samples: {X_test.shape[0]}')
print(f'- Number of malignant samples: {y_test[y_test==0].shape[0]} ({y_test[y_test==0].shape[0]/y_test.shape[0]*100:.1f}%)')
print(f'- Number of benign samples: {y_test[y_test==1].shape[0]} ({y_test[y_test==1].shape[0]/y_test.shape[0]*100:.1f}%)')

<br>

**4) Visualize**

Use the *mean radius* (the first feature) to create two histograms: one for malignant cells and one for benign cells. Overlay the two histograms to visualize the distribution of this feature for both classes. Discuss any similarities or differences.

In [None]:
plt.figure()

######## Fill in the code below ########


########################################

plt.xlabel('Mean Radius')
plt.ylabel('Frequency')
plt.legend()
plt.show()

*your discussion*

Also use the *mean fractal dimension* (the ninth feature) to create two histograms. Compare the distributions with those from the *mean radius*. Which feature appears to better separate the two classes?

In [None]:
plt.figure()

######## Fill in the code below ########


########################################

plt.xlabel('Mean Fractal Dimension')
plt.ylabel('Frequency')
plt.legend()
plt.show()

*your discussion*

<br>

**5) Naive Bayes Classification**

Train a Gaussian Naive Bayes classifier using only *mean radius* and discuss its accuracy.

In [None]:
######## Fill in the code below ########


########################################

accuracy = np.mean(y_pred == y_test)
print(f'Accuracy using mean radius: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

*your discussion*

Train another Gaussian Naive Bayes classifier using only *mean fractal dimension* and discuss its accuracy.

In [None]:
######## Fill in the code below ########



########################################

accuracy = np.mean(y_pred == y_test)
print(f'Accuracy using mean fractal dimension: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

*your discussion*

Also train a Gaussian Naive Bayes classifier using both *mean radius* and *mean fractal dimension* and evaluate its accuracy. Don't forget to scale the features now! How does this compare to the previous two models?

In [None]:
######## Fill in the code below ########






########################################

accuracy = np.mean(y_pred == y_test)
print(f'Accuracy using mean radius and mean fractal dimension: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

*your discussion*

Lastly, train a Gaussian Naive Bayes classifier using all 30 features and evaluate its accuracy. Compare the performance of this model to the previous models.

In [None]:
######## Fill in the code below ########






########################################

accuracy = np.mean(y_pred == y_test)
print(f'Accuracy using all features: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

*your discussion*

<br>

**6) Feature Analysis**

One assumption of Naive Bayes is that the features are mutually independent. Generate a heatmap of Pearsons (pairwise) correlation coefficient. Which features correlate? Does the correlation make sense?

In [None]:
######## Fill in the code below ########




########################################

*your discussion*

Remove three of the features that show high correlation with another feature and run the analysis you did in 5) again. Compare the accuray to the value you optain when you include all features.

In [None]:
######## Fill in the code below ########






########################################

accuracy = np.mean(y_pred == y_test)
print(f'Accuracy using all features: {accuracy:.2f}')

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

*your discussion*