---

## **Machine Learning - II**

---

### **Gaussian Naive Bayes on Breast Cancer dataset**

In [None]:
# Importing necessary libraries
from sklearn.datasets import load_breast_cancer  # Importing the Breast Cancer dataset from sklearn

In [None]:
# Loading the Breast Cancer dataset
ds = load_breast_cancer()

In [None]:
# Assigning feature variables to X
X = ds.data  # Feature data of the Breast Cancer dataset (mean radius, mean texture, etc.)
X  # Displaying the feature data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

In [None]:
# Assigning target labels to y
y = ds.target  # Target labels (0: Malignant, 1: Benign)
y  # Displaying the target labels

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [None]:
# Importing Gaussian Naive Bayes and train-test split function
from sklearn.naive_bayes import GaussianNB  # Gaussian Naive Bayes for continuous features
from sklearn.model_selection import train_test_split  # Splitting the data into training and testing sets

In [None]:
# Splitting the dataset into training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=1)
# xtrain, ytrain: Training data and labels
# xtest, ytest: Testing data and labels
# test_size=0.25 means 25% of the data is used for testing

In [None]:
# Initializing the Gaussian Naive Bayes model
model1 = GaussianNB()  # Gaussian Naive Bayes is used when the input features are continuous and normally distributed
# if not normally distributed they have to be transformed

# It can be used for both binary and multiclass classification problems

In [None]:
# Training the model with the training data
model1.fit(xtrain, ytrain)  # Fitting the model on training data (xtrain) and labels (ytrain)

In [None]:
# Making predictions on the test data
predictions = model1.predict(xtest)  # Predicting the labels for the test data (xtest)
predictions  # Displaying the predicted labels

array([0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1])

In [None]:
# Importing metrics for evaluation
from sklearn.metrics import accuracy_score, confusion_matrix  # Importing accuracy score and confusion matrix

In [None]:
# Calculating the accuracy of the model
accuracy_score(ytest, predictions)  # Calculating accuracy score (ratio of correct predictions)

0.9440559440559441

In [None]:
# Generating the confusion matrix
confusion_matrix(ytest, predictions)  # Creating a confusion matrix to visualize prediction results

# Output:
# array([[50,  5],  # 50 Malignant cases correctly classified, 5 misclassified as Benign
#        [ 3, 85]]) # 85 Benign cases correctly classified, 3 misclassified as Malignant

array([[50,  5],
       [ 3, 85]])

---
### **Results and Interpretation:**
---
The confusion matrix and accuracy score provide the following insights:

- **Accuracy Score:** 0.9441
    - This indicates that the model correctly classified approximately 94.41% of the test samples.
  
- **Confusion Matrix:**
    - **Malignant (0):** 50 out of 55 Malignant cases were classified correctly, with 5 misclassified as Benign.
    - **Benign (1):** 85 out of 88 Benign cases were classified correctly, with 3 misclassified as Malignant.

---
### **Conclusion:**
---
- **Gaussian Naive Bayes** performs well on the Breast Cancer dataset, achieving a high accuracy of **94.41%**. This suggests that the model is effective for this dataset.
- The confusion matrix shows that the model is more likely to misclassify a Malignant case as Benign compared to the reverse, which is a critical consideration for medical diagnosis.
---
### **Interpretation:**
---
- The Gaussian Naive Bayes model is suitable for this dataset since the features are continuous and approximately normally distributed.
- If the features deviate significantly from normality, a transformation (e.g., logarithmic or Box-Cox) may be necessary to improve the model's performance.

- **Clinical Significance:** In the context of breast cancer diagnosis, a false negative (misclassifying a malignant tumor as benign) can have serious consequences, so model improvements may be needed to minimize such errors.
---