# **Introduction to Breast Cancer Classification with scikit-learn**

In this Jupyter notebook, we will walk you through the process of performing breast cancer classification using the breast cancer Wisconsin dataset available in the scikit-learn library. This notebook is designed for beginners who want to get started with machine learning and gain a clear understanding of how to use the scikit-learn library for machine learning tasks.

## **The Breast Cancer Wisconsin Dataset**
The breast cancer Wisconsin dataset, often referred to as the "breast cancer dataset," is a widely used dataset in machine learning. It contains various features extracted from breast cancer biopsies and is used for classifying tumors as either malignant or benign based on these features. This dataset is included in the scikit-learn library, making it easily accessible for our analysis.

## **Classification with a Naive Bayes Classifier**
In this notebook, we will employ a Naive Bayes classifier for the task of breast cancer classification. The Naive Bayes method is a simple and effective classification algorithm that is particularly suitable for beginners. We will take you through the process of loading the dataset, preprocessing the data, and training a Naive Bayes classifier to classify breast tumors.

## **Basics of Using scikit-learn**
One of the main goals of this notebook is to help you grasp the fundamentals of using the scikit-learn library for machine learning. You will learn how to load and preprocess datasets, split data into training and testing sets, train machine learning models, and evaluate their performance. We will guide you through each step with explanations and code examples, ensuring that you build a solid foundation in using scikit-learn for your machine learning projects.

## **Visualizations**
Throughout this notebook, we will also incorporate visualizations to help you better understand the data and model performance. Visualization is a powerful tool in machine learning, and we will demonstrate its importance in gaining insights from your data.

## **Target Audience**
This notebook is specifically designed for beginners who are taking their first steps in machine learning. If you are looking for a simple, beginner-friendly resource to kickstart your career in machine learning, you've come to the right place. By the end of this notebook, you will have a good grasp of the essential concepts and practical skills required to build and evaluate machine learning models.

Let's dive in and start our journey to classify breast cancer tumors using scikit-learn!

# **Environment Setup**

Before we dive into the practical aspects of breast cancer classification, let's ensure that our environment is properly set up. In this section, we will guide you through the process of setting up the necessary libraries and functions that are essential for the smooth functioning of this notebook.

To get started, we need to import the following libraries:

- **scikit-learn (sklearn):** This library is at the heart of our machine learning tasks. It provides tools for data preprocessing, model training, and evaluation. We'll use scikit-learn extensively throughout this notebook.

- **NumPy:** NumPy is a fundamental library for scientific computing in Python. It enables us to work with arrays, matrices, and numerical operations efficiently.

- **Matplotlib:** Matplotlib is a powerful library for creating data visualizations. We'll use it to create various plots and graphs to better understand our data and model performance.

- **Plotly:**  This is yet another powerful library, which is an advance version of the Matplotlib library because it allows us to create interactive visualizations.

- **Pandas:** Pandas is a versatile library for data manipulation and analysis. We'll use it to load and manipulate the breast cancer dataset.

By ensuring that we have the right libraries in place, we set the stage for a hassle-free exploration of machine learning and breast cancer classification. Now that our environment is properly configured, let's proceed to load the breast cancer dataset and begin our journey into the world of machine learning!

In [85]:
# Data Visualization
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Naive Bayes Model
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, f1_score

# **Breast Cancer Wisconsin Dataset**

The dataset we are working with is the **Breast Cancer Wisconsin dataset**, a widely used repository in the field of **machine learning**. It comprises a comprehensive set of features extracted from **breast cancer biopsies**, with the primary objective of classifying tumors as either **malignant or benign**. The dataset contains **569 instances**, each characterized by **30 features**, which encompass various measures related to the **cell nucleus**, such as *radius, texture, smoothness, compactness, and concavity.*

These features play a crucial role in determining the nature of the tumor. Specifically, the dataset includes **mean, standard error, and worst (or largest)** values for these features, resulting in a total of **30 feature columns**. This rich dataset serves as an ideal foundation for exploring the application of machine learning techniques to the important task of breast cancer classification.

In [2]:
# Load the dataset
dataset = load_breast_cancer()

# Extract respective information
data = dataset.data
features = dataset.feature_names
labels = dataset.target
class_names = list(dataset.target_names)

Let's have a look at the description of the data set in order to get a much better in-depth understanding of the data.

In [3]:
print(dataset.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

Let's specifically focus at the attributes or the features that we have in hand.

- **Radius:** This feature represents the mean distance from the center of the nucleus to points on its perimeter, providing information about the size of the nucleus.

- **Texture:** Texture is the standard deviation of gray-scale values in the nucleus, offering insights into the variation in pixel intensity.

- **Perimeter:** The perimeter attribute indicates the total length of the nucleus's boundary, reflecting the overall shape and size.

- **Area:** Area represents the total number of pixels inside the boundary of the nucleus, providing additional information about the size of the nucleus.

- **Smoothness:** Smoothness characterizes the local variation in the lengths of the radius, giving an idea of how irregular or smooth the nucleus's shape is.

- **Compactness:** Compactness is calculated as (perimeter^2 / area - 1.0), and it quantifies how compact the shape of the nucleus is. Higher values suggest a more irregular shape.

- **Concavity:** Concavity indicates the severity of concave portions in the contour of the nucleus, helping to distinguish between tumors with different structural characteristics.

- **Concave Points:** This attribute counts the number of concave portions in the contour, providing information about the number of irregularities in the nucleus's shape.

- **Symmetry:** Symmetry measures how symmetric or asymmetric the nucleus is, contributing to the understanding of its overall shape.

- **Fractal Dimension:** The fractal dimension, often referred to as the "coastline approximation," quantifies the complexity of the nucleus's perimeter. A higher fractal dimension indicates a more complex and irregular shape.

These attributes collectively form the basis for the classification of breast tumors as malignant or benign, enabling the development of machine learning models for accurate diagnosis.


---

The Breast Cancer Wisconsin dataset includes two classes for classifying tumors:

- **Malignant (M):** Tumors classified as malignant are cancerous and represent a threat to the patient's health. These tumors are typically characterized by aggressive growth and potentially invasive properties.

- **Benign (B):** Benign tumors are non-cancerous and generally pose a lower risk to the patient's health. They tend to have a less aggressive growth pattern and are usually contained within a well-defined boundary, making them less harmful.

The task in this dataset is to accurately classify tumors as either malignant (M) or benign (B) based on the provided features, enabling medical professionals to make informed decisions regarding patient diagnosis and treatment.

---

A comprehensive understanding of the dataset's features is paramount for conducting thorough and insightful data analysis. Proficiency in comprehending the individual features empowers data analysts to delve into the data with greater precision and specificity. It enables the identification of feature correlations, facilitating the creation of more refined and context-rich predictive models. Additionally, this familiarity with the features equips analysts with a prior knowledge of the dataset, allowing for the early detection of outliers or anomalous data points. Understanding the normal range of data values is invaluable in this regard. Furthermore, it streamlines the process of establishing relationships between target and feature columns, enhancing the overall interpretability of the data. In sum, a profound grasp of the dataset's features serves as the bedrock for data exploration and analysis, underpinning the development of robust insights and predictive models.

In [4]:
info = f"""
Class Names: \n\t{", ".join(c.title() for c in class_names)}

Feature Names: \n\t{", ".join(f.title() for f in features[:10])}
        {", ".join(f.title() for f in features[10:20])}
        {", ".join(f.title() for f in features[20:])}

Total No. of Features: {len(features)}
"""

print(info)


Class Names: 
	Malignant, Benign

Feature Names: 
	Mean Radius, Mean Texture, Mean Perimeter, Mean Area, Mean Smoothness, Mean Compactness, Mean Concavity, Mean Concave Points, Mean Symmetry, Mean Fractal Dimension
        Radius Error, Texture Error, Perimeter Error, Area Error, Smoothness Error, Compactness Error, Concavity Error, Concave Points Error, Symmetry Error, Fractal Dimension Error
        Worst Radius, Worst Texture, Worst Perimeter, Worst Area, Worst Smoothness, Worst Compactness, Worst Concavity, Worst Concave Points, Worst Symmetry, Worst Fractal Dimension

Total No. of Features: 30



The central variable that defines our **machine learning model's objective and functionality** is the **"target variable."** In this specific dataset, we are dealing with **binary classification**, as there are **two distinct classes**. The output values we aim to predict are **binary**: **0 represents the first class, and 1 represents the second class**.

This **binary nature** signifies that we are focused on **distinguishing between these two specific classes**. If there were **more than two classes**, it would constitute a **multi-class classification task**, which involves a **different approach and complexity in model development**. Therefore, our **target variable** is of paramount **importance**, as it guides our **entire modeling process and the nature of the task at hand.**

In [5]:
N_CLASSES = len(class_names)

print(f"Number of Classes: {N_CLASSES}")

Number of Classes: 2


Let's have a look at our data points / values / samples.

In [6]:
data

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

Our dataset, at its current state, is not in a human-friendly raw format; it is represented in scientific notations and lacks detailed information about its features. While machine learning models operate effectively with numerical data, it can be challenging for human comprehension. To facilitate a deeper understanding of the data, we will convert it into a Pandas DataFrame. Pandas is a specialized library tailored for data manipulation, enabling us to organize and present our data in a structured and comprehensible manner, making it more accessible for human analysis and interpretation. This transformation into a Pandas DataFrame will enhance our ability to explore, visualize, and gain insights from the dataset.

In [7]:
df = pd.DataFrame(data, columns=features)
df['Class'] = pd.Series(labels).map({
    0: "Malignant",
    1: "Benign"
})
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,Class
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,Malignant
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,Malignant
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,Malignant
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,Malignant
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,Malignant


In [8]:
print(f"Number of data samples: {len(df)}")

Number of data samples: 569


As expected, we have 569 data samples with 30 features.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

The **good news** is that we can **bypass the data preprocessing step** for this dataset, as it is already in a **processed state**. Firstly, the data is presented in **numeric format**, making it directly compatible with **machine learning models**. Secondly, there are **no missing values**. A quick glance at **the null count** reveals that all features contain **569 non-null entries**, indicating the **absence of blank, null, or unexpected values**. Additionally, all values are of **the float64 data type**, confirming their **numeric nature**. This eliminates the need for converting **categorical features into numeric representations**. With this dataset's clean and numeric structure, we are well-equipped to proceed with **data analysis and model development**.

---

## **Statistical Analysis**


For those who **prioritize a deeper statistical understanding** of the data over visual analysis, the aim is to comprehend the dataset based on its **numerical properties**. While **machine learning models typically operate the same way**, there are **situations or user requirements** where a **statistical perspective** takes precedence over visualizations.

Usually, both approaches are employed in conjunction, but **visual analysis** tends to be **more popular** due to its **comprehensive and explanatory nature.** Nonetheless, there are instances where a **statistical analysis** is indispensable, and this is where Python libraries like **Pandas and Numpy** come into play. Without delving into an extensive discussion of statistical analysis here, it's worth noting that **Pandas provides a pivotal feature** called **"describe,"** which furnishes a wealth of **mathematical insights** about the data, including key statistics such as the **mean, quantiles, data distribution, and more**.

This feature empowers data analysts to grasp the data's numerical characteristics in a structured manner, laying the foundation for a more profound understanding.

In [10]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


For this notebook, we will specifically focus on the visual analysis of the data simply because it's easy to understand the data based on visual charts.

# **Exploratory Data Analysis**

Exploratory Data Analysis is a fundamental step in the data analysis process that serves as a cornerstone for understanding, summarizing, and visualizing a dataset. This phase is essential for gaining insights into the data's structure, characteristics, and peculiarities. EDA empowers data analysts to make informed decisions about data preprocessing, model selection, and feature engineering. By examining key statistics, data distribution, and relationships between variables, EDA allows us to uncover patterns and anomalies in the data.

During EDA, we employ various statistical and graphical techniques to explore the dataset comprehensively. Common charts and plots often used include:

- **Histograms:** Histograms provide a visual representation of the data's distribution, allowing us to identify patterns, central tendencies, and data spread.

- **Box Plots:** Box plots reveal the spread and central tendencies of the data, highlighting potential outliers and skewness.

- **Scatter Plots:** Scatter plots display the relationship between two variables, helping us identify correlations, clusters, or trends.

- **Correlation Heatmaps:** These visualizations show the pairwise correlations between variables, which can guide feature selection and model building.

We need EDA when:

1. **Data Understanding:** To gain a comprehensive understanding of the dataset and its properties, including data quality, missing values, and potential outliers.

2. **Feature Selection:** To identify relevant features that are strongly associated with the target variable or demonstrate meaningful interactions.

3. **Model Selection:** To choose the appropriate machine learning model based on the data's characteristics.

4. **Anomaly Detection:** To detect unusual patterns or outliers that may impact model performance or require special treatment.

To conduct EDA, one should:

1. **Load and Inspect Data:** Begin by loading the dataset and inspecting its structure, dimensions, and data types.

2. **Handle Missing Data:** Address missing values, if any, by imputation or removal, ensuring data completeness.

3. **Summary Statistics:** Calculate summary statistics such as means, medians, and percentiles to understand the data's central tendencies.

4. **Visualization:** Create visualizations like histograms, box plots, scatter plots, and correlation heatmaps to visualize data distributions, relationships, and anomalies.

5. **Hypothesis Testing:** Explore statistical tests to validate hypotheses and assess the significance of findings.

Exploratory Data Analysis equips data analysts with the tools to uncover insights, identify data peculiarities, and make informed decisions throughout the data analysis and machine learning journey.

---

## **Method for EDA**

In the **realm of exploratory data analysis (EDA),** there is room for **flexibility** in the approach chosen, but it is beneficial to follow a **structured path to gain deeper insights** into the dataset. My **preferred direction** begins with an **in-depth examination** of individual features, understanding their characteristics, and employing specific methods such as **histograms, density charts, and summary statistics.**

This **initial step provides** a **solid foundation** for grasping the **univariate distribution** and behavior of **each variable**. Subsequently, I shift my focus towards **analyzing the correlations** between these features, utilizing techniques like **Pearson's correlation coefficient**, to unveil the **relationships and dependencies** within the data. Finally, I employ **various visualizations**, including **scatter plots, line charts, bar charts, and column charts**, to visually represent these correlations.

This systematic approach allows for a **comprehensive exploration** of the data from individual feature attributes to their interrelationships. While **alternative approaches** may delve directly into **correlation analysis**, beginning with an understanding of individual features ensures a **more holistic comprehension** of the dataset.

## **EDA Functions**

To streamline the process of generating multiple similar plots, such as **histograms** for **different features**, it is efficient to employ **functions** that can be reused with **minimal code modification**. Since our main **data frame remains relatively consistent** and processed, creating these plotting functions will **significantly enhance our workflow**. This approach saves us from **duplicating code and simplifies** the task of producing **various visualizations for different features**.

In [11]:
def histogram(feature, data_frame = df, color = False):
    fig = px.histogram(
        data_frame,
        x = df[feature],
        title = f"Histogram of {feature.title()}",
        text_auto=True,
        color = "Class" if color else None
    )
    fig.update_layout(
        xaxis_title = feature.title(),
        yaxis_title = "Frequency"
    )
    fig.show()

def boxplot(feature, data_frame = df):
    fig = px.box(
        data_frame,
        x = df[feature],
        title = f"Box Plot of {feature.title()}",
        color = "Class"
    )
    fig.update_layout(
        xaxis_title = feature.title(),
        yaxis_title = "Class"
    )
    fig.show()

def violinplot(feature, data_frame = df):
    fig = px.violin(
        data_frame,
        x = df[feature],
        title = f"Violin Plot of {feature.title()}",
        color = "Class"
    )
    fig.update_layout(
        xaxis_title = feature.title(),
        yaxis_title = "Class"
    )
    fig.show()

def pieplot(feature = "Class", data_frame = df, title = "Class Distribution"):

    class_dis = df[feature].value_counts()

    fig = px.pie(
        data_frame,
        values = class_dis.values,
        names = class_dis.index,
        title = title,
        color = class_dis.index,
        hole = 0.4
    )
    fig.update_layout(
        xaxis_title = feature.title(),
    )
    fig.show()

def barplot(feature = "Class", data_frame = df, title = "Class Distribution"):

    class_dis = df[feature].value_counts()

    fig = px.bar(
        data_frame,
        x = class_dis.values,
        y = class_dis.index,
        title = title,
        color = class_dis.index,
        text_auto=True
    )
    fig.update_layout(
        xaxis_title = "Value Counts",
        yaxis_title = "Classes"
    )
    fig.show()

def class_distribution_plot():
    barplot()
    pieplot()

def scatterPlot(feature1, feature2, color = "Class", line = False):
    fig = px.scatter(
        df,
        x=feature1,
        y = feature2,
        color = color,
        trendline = "ols" if line else None
        )

    fig.update_layout(
        xaxis_title = feature1.title(),
        yaxis_title = feature2.title(),
    )

    fig.show()

## **Visualization**

Let's commence our data visualization journey with a pivotal graph—the **class distribution chart**. This graph holds **immense significance** as it determines whether our **model exhibits bias towards a specific class**. When one class has a **significantly higher number of samples compared to the other class**, it indicates an **uneven distribution,** often resulting in the **model attributing more weight to the class with higher representation**, i.e., **higher frequency**.

This can lead the model to be **more inclined** towards the **majority class**, potentially resulting in a **biased prediction**. For instance, if one class **constitutes 60% of the data**, even a **random-guessing model** would be **correct about 60% of the time**. Such an **imbalance is undesirable**.

Therefore, **addressing class imbalance** becomes a **crucial step** in ensuring that **our model makes fair and accurate predictions**, regardless of the **class distribution.**

In [12]:
class_distribution_plot()

Observing the **bar and pie plots** for **class distribution**, it becomes evident that a **notable class imbalance exists** within our dataset.

* Approximately **63% of the data is attributed to the benign class**, while the remaining **37% represents the malignant class**. This level of class imbalance, roughly **40% to 60%,** indicates a **substantial imbalance**, where **even random guessing would yield an accuracy of about 60%.** This **imbalance is undesirable**, as it signifies that **our model may be simplistically guessing, without grasping the underlying patterns** in the data.

* To address this issue, **one common and straightforward approach** is to employ **stratified shuffle splitting**, which ensures that the **class distribution remains consistent across training, testing, and validation datasets**.

It's important to note that **class imbalance isn't necessarily a problem**, as it **may accurately reflect the distribution of classes in the real world**. In some cases, the **higher prevalence** of **one class** in the **data aligns** with **real-world occurrences**, and **artificially balancing classes may not be desirable**.

### **Mean Radius**

In [13]:
histogram(features[0], color=True)

* Upon examining the frequency plot categorized by our **target columns**, a distinct difference in **frequencies becomes evident**. Of notable importance is the **value range within which these data points lie**. Notably, as we observe **higher values** of the **mean radius**, the **malignant class exhibits a higher distribution**, while in contrast, for **lower values** of the **mean radius**, a **prominent peak** is observed within the **benign class**.

* This observation suggests that tumors with **lower mean radius values** are **more likely** to be **classified as benign**. While this observation is **not** an **absolute rule**, it is noteworthy that a **concentration of benign cases** is found in this **specific range**. It's worth mentioning that for extremely **low mean radius values**, the likelihood of the **tumor being benign is notably high,** whereas for **extremely high values**, **malignancy is predominant**.

* The critical aspect here is the **concentration within the range of approximately 10 to 15**, where the **peak is predominantly formed by the benign class**, while the **distribution across a wider range of mean radius values is attributed to the malignant class**.

---

Note: The high peak attributed to the benign class may be just due to the class imbalance. And it's a highly likely to be due to the class imbalance.

In [14]:
boxplot(features[0])
violinplot(features[0])

The **box plot and violin plot** offer a more distinct perspective on the **data's distribution** and **distinctions compared** to the **histogram plot**.

* Particularly, the **violin plot** vividly illustrates how the **malignant class** is distributed across the **entire range**, revealing the presence of **statistical outliers**.

* The **box plot**, on the other hand, provides a **clear demarcation of the overall distribution** and **emphasizes that the mean values are considerably distant from each other**. This distinctiveness is **advantageous**, as it enables **our model** to better **discern the data**.

It's worth noting the **substantial overall range** of the **malignant class**, which is **spread across a wide spectrum**. While this **comprehensive range poses a challenge** in **precisely detecting tumors solely based on this feature**, it provides valuable insights into the **complexity and diversity of the data**.

---

### **Mean Texture**

In [15]:
histogram(features[1])
histogram(features[1], color=True)

The histogram for the **mean texture closely mirrors** the **histogram of the mean radius**, with both **exhibiting prominent peaks in similar regions and for approximately the same value ranges**.

* Even when categorizing the distribution based on the **target columns**, similarities emerge, with **one significant distinction**. This time, the **benign class values** are **distributed across the entire spectrum,** and similarly, the **malignant class values also exhibit a wide distribution**.

* Unlike the mean radius, where the **distribution was more concentrated**, the **mean texture values** are spread over a **broad range**.

This observation underscores that **classifying tumors solely** based on **mean texture** is **not straightforward.** Notably, the presence of **singular data** points with a **count of one indicates** the **existence of outliers**, often attributed to **extreme or uncommon values within the dataset**.

In [16]:
boxplot(features[1]), violinplot(features[1])

(None, None)

In contrast to the **clear separation** observed in the **histograms**, the **box plot and violin plot** for **mean radius and mean texture** reveal an **unmistakable distinction**. While the **mean radius featured two distinct** and **separate groups**, the **mean texture** showcases a **different scenario**.

In the case of **mean texture**, the **two groups** are **not distinctly separate**; rather, they overlap. Even though there's a trend where **one class starts to emerge as the other class begins to decline**, the **overlap between them is notable**. This overlapping nature between the **two classes** in the **mean texture feature demonstrates** that they are **closely intertwined** and **less distinct compared** to the **mean radius feature**.

A **noticeable observation** here is the **notably higher count** of **outliers**. **Outliers** are data points that **deviate from the norm and they are present in both classes.**

---

### **Mean Perimeter**

In [17]:
histogram(features[2])
histogram(features[2], color=True)

Once more, the **distribution closely resembles** that of the **mean radius**. This similarity could be **attributed to the interdependence** of **these properties**, such as the **radius and the perimeter**.

In [18]:
boxplot(features[2])
violinplot(features[2])

As anticipated, **our observation** appears to be **accurate**, as the **perimeter and radius** indeed exhibit **interdependencies**, given that they are both **derived from similar value ranges**. A distinct resemblance between the **two distributions** becomes **evident**, particularly when **examining the violin plot**.

---

### **Mean Area**

In [19]:
histogram(features[3])
histogram(features[3], color=True)

Observing the **area distribution**, we note that while the **values reside** in a **different range**, some **evident distinctions separate this distribution from the previous ones**.

However, a **general similarity persists**, with the **benign class** displaying a **substantial distribution** and a **peak in the initial value ranges**. In contrast, the **malignant class** exhibits a **wider distribution** and showcases **outliers toward the extreme end**, akin to the **patterns observed in the prior features.**

----

This approach, delving into **each feature individually**, is a **standard practice**, as it **unveils unexpected insights**. However, there's a **quicker method** at our **disposal**. Rather than plotting and **inspecting each variable one by one**, we can **efficiently examine the correlation plot**.

**Correlations inherently capture relationships between variables**, simplifying the **process of identifying connections between the target variable and other features**, as well as unveiling **correlations among the features themselves**. These **inter-feature correlations** offer a **comprehensive view** of the **data's relationships**.

### **Correlation Plot**

The **Pearson correlation**, often referred to as **Pearson's correlation coefficient** or **Pearson's r**, is a **statistical measure** that quantifies the **linear relationship** between **two variables**.

It provides a **numerical value** between **-1 and 1**, with **1 signifying a perfect positive correlation, -1 indicating a perfect negative correlation, and 0 suggesting no linear correlation**.

**Pearson correlation** is widely used in **data analysis** to assess how changes in **one variable relate to changes in another**, making it a **valuable tool** for **understanding data relationships** and **patterns**.

In [20]:
df_copy = df.drop(columns=["Class"])
pearson_corr = np.round(df_copy.corr(method="pearson"), 2)
corr = px.imshow(
    pearson_corr,
    text_auto=True,
    width = 1400,
    height = 1400
    )
corr.show()

The **Spearman correlation**, named after **Charles Spearman**, is a **non-parametric statistical measure** used to evaluate the **strength and direction of monotonic relationships between two variables**. Unlike the **Pearson correlation**, **Spearman's correlation** does **not assume a linear relationship** and is more **robust to outliers**.

It quantifies the **similarity of the rank orders** of **data points** rather than their **specific values**. This makes it suitable for **assessing associations** in data when the **variables may not follow a linear pattern.** **Spearman correlation** values range from **-1 (perfect inverse monotonic relationship) to 1 (perfect monotonic relationship), with 0 indicating no monotonic association between the variables.**

In [21]:
df_copy = df.drop(columns=["Class"])
pearson_corr = np.round(df_copy.corr(method="spearman"), 2)
corr = px.imshow(
    pearson_corr,
    text_auto=True,
    width = 1400,
    height = 1400
    )
corr.show()

The **sheer magnitude** of this **correlation matrix** is **indeed substantial**, largely owing to the **exceptionally high number of features**, a **total of 30**.

This **abundance of features is atypical**; typically, a dataset contains **around 10 to 15 features at most**. Having **30 numerical features** is **quite extensive**. To **streamline our analysis**, we'll narrow our focus to the **most prominent** and **robust relationships**.

Given the **multitude of features**, we will **concentrate on specific**, **strong correlations** that are **evident** in the **Pearson correlation matrix**. Specifically, we will **investigate correlations with coefficients exceeding 0.9**, as these relationships hold particular **significance and warrant our attention.**

In [22]:
scatterPlot("mean radius", "mean perimeter")

The presence of a **clear linear relationship** is **self-evident**, requiring **no extensive explanation**. This **linear correlation** is rooted in a **straightforward principle**: the **perimeter is directly dependent on the radius**. As the **radius increases,** so does the **perimeter value**, leading to this **linear association between the two variables**.

In [23]:
scatterPlot("mean radius", "mean area")

The connection between **area and radius exhibits** a somewhat **linear appearance**, although it's **not strictly linear**. The crucial point to note with **Spearman correlation is its ability to detect changes in one variable corresponding to changes in another variable**. In this case, an increase in **radius corresponds to an observable increase in area**, albeit the relationship is more akin to a **quadratic or cubic one.** This **non-linearity** stems from the **fact that area is cubically related to the radius**, resulting in this **distinctive correlation pattern**.

In [24]:
scatterPlot("mean radius", "worst radius", line = True)

The relationship between the **worst radius and the mean radius** exhibits a distinct **linearity**, which is **readily apparent**. Employing a trend line **further elucidates this linear connection**.

One plausible reason behind the **linear relationship** between the **worst radius** and the **mean radius** lies in the way the features are calculated. **The worst radius represents** the **largest radius among multiple measurements**, while the **mean radius** is an **average value derived from various radius measurements**.

As a result, variations in the **worst radius** are **largely reflective of the spread of individual radius measurements within the dataset**. When these **individual measurements tend to be higher or lower, they influence both the worst and mean radius**, thus fostering a **linear relationship between the two features.**

In [25]:
scatterPlot("mean radius", "worst area")

The rationale behind this seemingly linear yet fundamentally cubic relationship is quite straightforward. It stems from the intrinsic nature of area and radius, which share a cubic relationship.

In [26]:
scatterPlot("mean texture", "worst texture")

Remarkably, the **texture feature** stands out as the **sole relationship** of **notable significance**. Identifying these **crucial relationships is paramount**, as they play a **pivotal role in feature selection**.

The challenge arises when we realize that while **this feature** holds importance, it does **not significantly contribute to other features beyond itself.** This is because **worst texture and mean texture are directly related to one another and lack independence.**

Consequently, they **exhibit this linear relationship**. However, in the context of **other features**, as well as the **target variable**, this **connection is not evident**. Therefore, the **removal of this feature would likely have minimal impact on the overall analysis.**

In [27]:
scatterPlot('mean compactness', 'mean concavity', line = True)

A **noticeable linear relationship** emerges between the **mean concavity and the mean compactness**. When examined in **conjunction with the class distribution**, it becomes evident that although they exhibit a **linear nature**, there are **nuances in compactness**. This divergence is attributed to the **slopes of the lines generated** with respect to each class. While the **relationship is linear**, there is a **discernible distinction** in terms of the **linear association**. Furthermore, higher values of **compactness tend to be associated with the malignant class**, whereas **lower values align more with the benign class**.

The existence of this relationship can be attributed to the underlying characteristics of the features. **Mean concavity and mean compactness** both describe aspects of the **shape and texture of cell nuclei within breast cancer biopsies**. Their **linear relationship likely stems from shared characteristics that influence both features, such as the compactness of the nuclei**. The differences observed in compactness can be attributed to variations in **cell structure and behavior**, which are indicative of the **benign and malignant classes.**

In [28]:
scatterPlot('mean concave points', 'mean concavity', line = True)

A **robust linear relationship** is observed between the **mean concave points and the mean concavity**, and notably, this **linear relationship** holds **consistently across both classes**. The noteworthy aspect is that these **relationships exhibit roughly the same slopes**, even though their **concentrations differ**. This **uniformity in slopes implies** that it is feasible to **readily associate lower and higher values with their respective classes**, as this **linear association is consistently maintained.**

----

A **careful examination** of the **correlation plot** reveals that the **majority of the correlations** are **not robust** enough to **significantly impact the target column**. The importance of both **cross-feature and intra-feature relationships** is also **relatively low**. Although some correlations stand out with **coefficients exceeding 90%, even reaching 99%** in **positive strength**, it is essential to understand that these relationships primarily **stem from the interdependence of features**.

This **interdependence** becomes evident when **examining features** such as **perimeter, area, and radius**, which are **most commonly associated** with **one another**. Consequently, throughout the **Spearman correlation analysis**, these **core relationships emerge** as the **most pronounced**, with the **highest and most robust coefficients**.

---

In [52]:
df_target = df.copy()
df_target.Class = df_target.Class.map(lambda x: class_names.index(x.lower()))
target_column_corr = np.round(df_target.corr(method = "spearman"), 2).Class[:-1]

fig = px.bar(
    x=target_column_corr.index,
    y=target_column_corr.values,
    color=target_column_corr.index,
    text_auto = True
    )

fig.update_layout(
    xaxis_title = "Features",
    yaxis_title = "Coefficient",
    showlegend = False
)

fig.show()

* Upon examining the **correlation coefficients** of other features in relation to our **target feature**, it becomes apparent that the **majority of these correlations are negative**. The most **substantial negative correlation**, at **-0.8**, is linked to the **worst perimeter feature.**

* There are **additional robust negative coefficients**, with some surpassing **0.75 or closely approaching 0.8**. Notable among these features are **mean concave points, worst area, worst radius, and worst concave points**. It is important to recognize that these **features largely occupy the top positions due to their interdependence**; worst perimeter is linked to worst radius, which is further associated with area, contributing to their prominent **negative correlation coefficients.**

* Excluding these features, **concavity and compactness** stand out with the **highest coefficients**. This emphasizes the intricate process of **feature selection,** as these **features demonstrate stronger(-ve) correlations** with the **target variable**.

Nonetheless, it is apparent that **no single feature** exhibits a **strong direct relationship** with the **target column**, underscoring the **complexity of feature-target associations**.

# **Data Splitting**

While our data is **predominantly preprocessed and doesn't require extensive preprocessing steps**, the challenge of **class imbalance still remains**. To address this issue, we will employ the **Stratified Shuffle Split technique**.

In [60]:
X_train, X_test, y_train, y_test = train_test_split(
    data, labels,
    shuffle = True,
    random_state = 42,
    stratify = labels,
    train_size = 0.85,
    test_size = 0.15
)

# **Naive Bayes (NB)**

The **Naive Bayes (NB)** method is a **powerful classification technique** used in **machine learning** and **data analysis**. It is based on **Bayes' theorem** and is particularly well-suited for **text classification tasks**, such as **spam email detection** or **sentiment analysis**. The "naive" in **Naive Bayes** refers to the **assumption that all features are independent of each other**, which simplifies the **mathematical calculations**.

> This property that the features are independent of each other is also seen in our data set.

The **key idea behind Naive Bayes** is to **calculate the probability** of an **instance belonging to a particular class** given its **feature values**. It does this by first estimating the **probability distributions** of the **features** for **each class** in the **training data**. These probability distributions are typically assumed to be **Gaussian (for continuous data) or multinomial (for categorical data).**

To make a prediction, **Naive Bayes calculates** the **probability** of an **instance** belonging to each **class** and assigns it to the class with the **highest probability**. This decision is made by **comparing the posterior probabilities for each class**.

Despite its **simplistic assumption** of **feature independence**, **Naive Bayes** often performs surprisingly well in many **real-world classification** tasks. It is **computationally efficient**, works well with **high-dimensional data**, and requires **relatively little training data**. **Naive Bayes** is a valuable tool in the **machine learning toolbox** and can be a **good choice** for tasks where its **assumptions align with the data**.

In [81]:
# Initialize Model
gnb = GaussianNB()

# Fit the data
model = gnb.fit(X_train, y_train)

# Test Predictions
model_preds = model.predict(X_test)

In [82]:
fig = px.imshow(
    confusion_matrix(y_test, model_preds),
    text_auto = True,
    title="Confusion Matrix"
)
fig.show()

The **simple model** appears to be **remarkably robust**. Upon examining its **confusion matrix**, we observe that it made errors in **only a very small number of samples**, specifically, **misclassifying just five instances**. In the vast majority of cases, the **model correctly predicted samples as either true positives or true negatives, underscoring its effectiveness**.

In [87]:
# Mathematical Scores
pscore = precision_score(y_test, model_preds)
rscore = recall_score(y_test, model_preds)
f1score = f1_score(y_test, model_preds)
acc = accuracy_score(y_test, model_preds)

fmt = f"""
Precision Score: {pscore}
Recall Score   : {rscore}
F1 Score       : {f1score}
Accuracy Score : {acc}
"""
print(fmt)


Precision Score: 0.9622641509433962
Recall Score   : 0.9444444444444444
F1 Score       : 0.9532710280373832
Accuracy Score : 0.9418604651162791



The model's performance is indeed quite impressive, as evidenced by the evaluation scores.

These scores indicate that the model has a **high degree of accuracy**, **precision, recall, and F1 score**, all of which are **essential metrics** for evaluating a **classification model's performance**. This underscores the **effectiveness** of the **model in accurately predicting and classifying instances**.

# **Summary**

In this notebook, we embarked on a **journey to explore breast cancer classification** using a simple yet powerful **machine learning model**. We initiated our analysis by setting up the **necessary environment and libraries**, loading the **breast cancer dataset** from **scikit-learn's repository**. As we delved into the data, we discovered a **comprehensive dataset with 30 features** that capture **various characteristics** of cell nuclei, making it a suitable candidate for **machine learning classification**.

After **initial exploration and visualization**, we identified a **class imbalance issue**, which could potentially **impact the model's performance**. To address this, we implemented a **Stratified Shuffle Split** to ensure an even **class distribution across training and testing datasets**. This step aimed to **improve the model's ability to learn from both classes**.

Our model of choice was the **Naive Bayes classifier**, which is well-suited for **text classification** and has proven to be effective in **real-world scenarios**. To our delight, the model demonstrated **remarkable performance**, with only a **small number of misclassifications**. Evaluation metrics, including **precision, recall, F1 score, and accuracy, confirmed the model's high accuracy and robustness**.

In summary, our journey encompassed **data exploration**, **feature analysis**, **model training**, and **evaluation**, **culminating in a highly accurate breast cancer classification model** that showcased the **potential for reliable diagnosis** based on **cell nuclei characteristics**. This project underscores the **significance of machine learning** in **medical research** and provides a **promising foundation** for **further research** and **applications in the field of healthcare**.