Question 1 :  What is Information Gain, and how is it used in Decision Trees?

Answer:-Information Gain tells us which feature best separates the data into distinct classes.Information Gain (IG) is a concept from information theory that measures how much uncertainty (impurity) in a dataset is reduced after splitting the data based on a particular feature.




*   Information Gain measures reduction in entropy

*   Used to choose the best split in Decision Trees

**How Information Gain is used in Decision Trees**  


* Measures Reduction in Entropy:
Information Gain quantifies how much the uncertainty (or impurity) in a dataset is reduced after a split. A higher Information Gain means a better split.



*  Feature Selection:When building a Decision Tree, at each step, the algorithm evaluates all available features to see which one provides the most significant Information Gain if used to split the data. The feature that yields the highest Information Gain is chosen for that particular split.
*   Goal: The ultimate goal is to create homogeneous child nodes, meaning each child node contains data points predominantly belonging to a single class. Information Gain helps in achieving this homogeneity efficiently by selecting the most informative features.



Question 2: What is the difference between Gini Impurity and Entropy?

Both Gini Impurity and Entropy are measures of impurity or disorder used in the construction of decision trees to determine the optimal split at each node. The goal is to minimize impurity after a split.

Entropy

*   **Definition:** Entropy is a concept from information theory that measures the average level of uncertainty or surprise inherent in a variable's possible outcomes. In the context of decision trees, it quantifies the impurity of a dataset.
*   **Formula:** $Entropy(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)$, where $S$ is the dataset, $c$ is the number of classes, and $p_i$ is the proportion of observations belonging to class $i$.
*   **Interpretation:**
    *   An Entropy of 0 means the node is perfectly pure (all samples belong to the same class).
    *   An Entropy of 1 (for a binary classification) means the node is perfectly impure (samples are equally divided among classes).
*   **Strengths:**
    *   More sensitive to changes in class probabilities, meaning it might lead to slightly more balanced trees.
    *   Often preferred when a more robust measure of uncertainty is desired.
*   **Weaknesses:**
    *   Involves logarithmic calculations, which can be computationally more expensive than Gini Impurity.
*   **Use Cases:** Often used in algorithms like ID3 and C4.

**Gini Impurity**

*   **Definition:** Gini Impurity measures the probability of incorrectly classifying a randomly chosen element in the dataset if that element were randomly labeled according to the distribution of labels in the dataset.
*   **Formula:** $Gini(S) = 1 - \sum_{i=1}^{c} p_i^2$, where $S$ is the dataset, $c$ is the number of classes, and $p_i$ is the proportion of observations belonging to class $i$.
*   **Interpretation:**
    *   A Gini Impurity of 0 means the node is perfectly pure (all samples belong to the same class).
    *   A Gini Impurity of 0.5 (for a binary classification) means the node is perfectly impure (samples are equally divided among classes).
*   **Strengths:**
    *   Computationally less intensive as it avoids logarithmic calculations.
    *   Favors larger partitions, which can sometimes lead to slight bias towards the majority class in a split.
*   **Weaknesses:**
    *   Can be less sensitive to class probability changes compared to Entropy.
*   **Use Cases:** Widely used in algorithms like CART (Classification and Regression Trees).



Question 3: What is Pre-Pruning in Decision Trees?

Pre-pruning, also known as early stopping, is a technique used in the construction of decision trees to prevent overfitting. Instead of growing a full decision tree and then pruning it back (post-pruning), pre-pruning involves stopping the tree growth prematurely if certain conditions are met.

The main idea behind pre-pruning is to restrict the complexity of the decision tree during its formation, thus reducing the risk of the tree learning noise in the training data and improving its generalization ability on unseen data.

### How Pre-Pruning Works:

During the tree building process, at each node, before a split is performed, a pre-pruning criterion is evaluated. If the criterion is met, the tree growth is halted at that node, and the node is made a leaf node. The class label for this leaf node is typically determined by the majority class of the samples reaching that node.

### Common Pre-Pruning Criteria:

Several criteria can be used to decide when to stop splitting a node:

1.  **Maximum Depth:** The tree stops growing once it reaches a predefined maximum depth. This limits the number of sequential splits that can occur.
2.  **Minimum Samples per Split:** A node will not be split if the number of samples in that node is below a certain threshold. This prevents creating splits on very small subsets of data, which might be noisy.
3.  **Minimum Samples per Leaf Node:** A split will only be considered if it results in child nodes that each contain at least a specified minimum number of samples. This ensures that leaf nodes are not too small.
4.  **Maximum Number of Leaf Nodes:** The tree growth stops when the total number of leaf nodes reaches a predefined limit.
5.  **Impurity Threshold:** A node will not be split if its impurity (e.g., Gini impurity or entropy) is below a certain threshold. If a node is already 'pure enough,' further splitting might not provide significant gain.
6.  **Information Gain/Impurity Decrease Threshold:** A split is only performed if the information gain (or decrease in impurity) achieved by that split is above a certain minimum value. If the gain is too small, the split is considered insignificant and is not made.
7.  **Cross-Validation Score:** The tree's performance on a validation set (or through cross-validation) can be monitored. If splitting a node leads to a decrease in validation performance, the split is rejected.

### Advantages of Pre-Pruning:

*   **Reduces Overfitting:** Directly addresses overfitting by limiting tree complexity.
*   **Faster Training:** Since the tree stops growing earlier, the training process is generally faster than building a full tree and then post-pruning it.
*   **Simpler Trees:** Tends to produce smaller and more interpretable trees.
*   **Computational Efficiency:** Avoids the computational cost of generating unnecessary branches.



Question 4: Train a Decision Tree Classifier using Gini Impurity and print feature importances.

This program will demonstrate how to train a Decision Tree Classifier using `criterion='gini'` and then retrieve and display the feature importances. We'll use the well-known Iris dataset for this example.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# 1. Load a sample dataset (Iris dataset)
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

print(f"Dataset features: {feature_names}")
print(f"Shape of data (X): {X.shape}")
print(f"Shape of target (y): {y.shape}\n")

# 2. Split data into training and testing sets (optional but good practice)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize the Decision Tree Classifier with Gini Impurity
#    criterion='gini' is the default, but we explicitly set it for clarity.
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# 4. Train the classifier on the training data
clf.fit(X_train, y_train)

print("Decision Tree Classifier trained successfully using Gini Impurity.\n")

# 5. Print the feature importances
print("Feature Importances:")
for i, importance in enumerate(clf.feature_importances_):
    print(f"  {feature_names[i]}: {importance:.4f}")

# You can also get predictions and evaluate the model if needed
# from sklearn.metrics import accuracy_score
# y_pred = clf.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)
# print(f"\nModel Accuracy on Test Set: {accuracy:.4f}")

Dataset features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Shape of data (X): (150, 4)
Shape of target (y): (150,)

Decision Tree Classifier trained successfully using Gini Impurity.

Feature Importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0191
  petal length (cm): 0.8933
  petal width (cm): 0.0876


Question 5: What is a Support Vector Machine (SVM)?

A **Support Vector Machine (SVM)** is a powerful and versatile machine learning algorithm capable of performing linear or non-linear classification, regression, and even outlier detection. It is primarily used for classification tasks.

### Core Idea:

The fundamental concept behind SVMs is to find the "best" hyperplane that separates data points belonging to different classes in a high-dimensional space. The "best" hyperplane is defined as the one with the largest margin between the two classes.

*   **Hyperplane:** In a 2D space, a hyperplane is a line. In a 3D space, it's a plane. In spaces with more than three dimensions, it's called a hyperplane.
*   **Margin:** The margin is the distance between the hyperplane and the closest data points from each class. These closest points are called **support vectors**.

### How SVM Works (for Classification):

1.  **Finding the Optimal Hyperplane:** The SVM algorithm aims to find a hyperplane that maximizes the margin. A larger margin generally means lower generalization error of the classifier.

2.  **Support Vectors:** The data points that lie closest to the decision boundary (hyperplane) and influence its position and orientation are called support vectors. These are the critical elements of the training set; if you remove them, the hyperplane's position would change.

3.  **Handling Non-linear Data (Kernel Trick):** SVMs are incredibly effective because they can handle non-linearly separable data through a technique called the **kernel trick**. When data points cannot be separated by a straight line (or plane) in their original dimension, the kernel trick implicitly maps the input data into a higher-dimensional feature space where it might become linearly separable. Common kernel functions include:
    *   **Linear Kernel:** Used for linearly separable data.
    *   **Polynomial Kernel:** Projects data into a higher dimension using polynomial functions.
    *   **Radial Basis Function (RBF) or Gaussian Kernel:** A popular choice for complex, non-linear relationships.

4.  **Soft Margin Classification:** In real-world scenarios, data is often noisy, and perfect linear separation might not be possible, or it might lead to overfitting. SVMs can handle this using **soft margin classification**. This allows some instances to be on the wrong side of the margin, or even the wrong side of the hyperplane, by introducing a hyperparameter `C`. A smaller `C` allows more margin violations (more regularization), while a larger `C` tries to keep all instances off the margin or on the correct side (less regularization).

### Key Features and Advantages:

*   **Effective in High-Dimensional Spaces:** Works well even when the number of features is greater than the number of samples.
*   **Memory Efficient:** Uses a subset of training points (support vectors) in the decision function, making it memory efficient.
*   **Versatile:** Can be used for both linear and non-linear classification and regression problems with different kernel functions.
*   **Robust to Outliers:** With soft margin classification, SVMs can be less sensitive to individual noisy data points.

### Disadvantages:

*   **Computationally Intensive:** Can be computationally expensive for very large datasets, especially with complex kernels.
*   **Sensitive to Feature Scaling:** SVMs are sensitive to the scaling of features. It's often recommended to normalize or standardize data before applying SVM.
*   **Kernel Choice:** Choosing the right kernel and its parameters can be challenging and often requires domain knowledge or extensive hyperparameter tuning.
*   **Lack of Probability Estimates:** Standard SVMs do not directly provide probability estimates. They output raw decision values, which can be converted to probabilities using techniques like Platt scaling, but this adds another layer of computation.

Question 6: What is the Kernel Trick in SVM?

The **Kernel Trick** is a fundamental concept that allows Support Vector Machines (SVMs) to effectively handle non-linearly separable data. In its essence, it's a method of using a kernel function to implicitly map input data into a higher-dimensional feature space, where it might become linearly separable, without actually calculating the coordinates of the data in that higher-dimensional space.

 Why is it needed?

Many real-world datasets are not linearly separable in their original feature space. This means you cannot draw a single straight line (or hyperplane in higher dimensions) to perfectly separate the different classes. For example, if you have data points forming concentric circles, a linear classifier would fail to distinguish between the inner and outer circles.

How it Works:

Instead of explicitly transforming the data into a higher dimension (which can be computationally expensive and memory intensive, especially for very high dimensions), the kernel trick uses a **kernel function** `K(x, x')` that calculates the dot product of the transformed features, `φ(x) · φ(x')`, directly in the original feature space. That is:

$K(x, x') = φ(x) · φ(x')$

Where:
*   `x` and `x'` are data points in the original feature space.
*   `φ` is the mapping function that transforms the data into a higher-dimensional space.
*   `K` is the kernel function.

This means that the SVM algorithm, which relies on dot products to calculate distances and angles between data points, can operate in the implicitly higher-dimensional space without ever needing to know the `φ` function or the explicit coordinates of the transformed data points.

Benefits of the Kernel Trick:

*   **Handles Non-Linear Separability:** The primary benefit is the ability to classify data that is not linearly separable in its original feature space.
*   **Computational Efficiency:** Avoids the explicit computation of coordinates in high-dimensional feature spaces, which can be computationally prohibitive. The kernel function itself is often much simpler and faster to compute.
*   **Memory Efficiency:** No need to store the high-dimensional transformed data points.
*   **Flexibility:** Different kernel functions can be chosen to suit various types of non-linear relationships in the data.

Common Kernel Functions:

1.  **Linear Kernel:**
    $K(x, x') = x · x'$
    (Equivalent to no transformation, used for linearly separable data)

2.  **Polynomial Kernel:**
    $K(x, x') = (γx · x' + r)^d$

    (Maps data to a higher-dimensional space using polynomial functions, with `d` as the degree of the polynomial, `γ` as a scaling factor, and `r` as a constant term.)

3.  **Radial Basis Function (RBF) / Gaussian Kernel:**
    $K(x, x') = e^{(-γ||x - x'||^2)}$

    (A very popular choice for complex, non-linear relationships. It creates a spherical decision boundary. `γ` is a parameter that defines the influence of a single training example).

4.  **Sigmoid Kernel:**
    $K(x, x') = \tanh(γx · x' + r)$
    
    (Often used in neural networks, but less common for SVMs compared to RBF or polynomial).

In summary, the kernel trick is a powerful mathematical tool that allows SVMs to find non-linear decision boundaries by implicitly operating in a higher-dimensional space, making them highly effective for a wide range of complex classification problems.

Question 7: Train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.

This program demonstrates how to train two Support Vector Machine (SVM) classifiers: one using a `linear` kernel and another using a `Radial Basis Function (RBF)` kernel. We will use the Wine dataset, split it into training and testing sets, train both models, and then compare their classification accuracies on the test set.

In [2]:
from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# 1. Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

print(f"Dataset features: {feature_names}")
print(f"Shape of data (X): {X.shape}")
print(f"Shape of target (y): {y.shape}\n")

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Standardize the features (important for SVMs, especially with RBF kernel)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Initialize and train SVM with Linear Kernel
print("Training SVM with Linear Kernel...")
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_scaled, y_train)
y_pred_linear = svm_linear.predict(X_test_scaled)
accuracy_linear = accuracy_score(y_test, y_pred_linear)
print(f"Accuracy of SVM (Linear Kernel): {accuracy_linear:.4f}\n")

# 5. Initialize and train SVM with RBF Kernel
print("Training SVM with RBF Kernel...")
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train_scaled, y_train)
y_pred_rbf = svm_rbf.predict(X_test_scaled)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)
print(f"Accuracy of SVM (RBF Kernel): {accuracy_rbf:.4f}\n")

# 6. Compare accuracies
print("--- Comparison ---")
if accuracy_linear > accuracy_rbf:
    print(f"The Linear Kernel SVM performed better with an accuracy of {accuracy_linear:.4f}.")
elif accuracy_rbf > accuracy_linear:
    print(f"The RBF Kernel SVM performed better with an accuracy of {accuracy_rbf:.4f}.")
else:
    print(f"Both Linear and RBF Kernel SVMs performed equally with an accuracy of {accuracy_linear:.4f}.")


Dataset features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Shape of data (X): (178, 13)
Shape of target (y): (178,)

Training SVM with Linear Kernel...
Accuracy of SVM (Linear Kernel): 0.9815

Training SVM with RBF Kernel...
Accuracy of SVM (RBF Kernel): 0.9815

--- Comparison ---
Both Linear and RBF Kernel SVMs performed equally with an accuracy of 0.9815.


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

The **Naïve Bayes classifier** is a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. It is a supervised learning algorithm used for classification tasks, particularly popular in text classification, spam filtering, and recommendation systems due to its simplicity and efficiency.

What is it?

At its core, the Naïve Bayes classifier works by calculating the probability of a data point belonging to a certain class, given the values of its features. It does this by leveraging **Bayes' Theorem**, which states:

$P(C|X) = \frac{P(X|C)P(C)}{P(X)}$

Where:
*   $P(C|X)$ is the posterior probability: the probability of class $C$ given feature vector $X$.
*   $P(X|C)$ is the likelihood: the probability of observing feature vector $X$ given class $C$.
*   $P(C)$ is the prior probability: the probability of class $C$ occurring independently.
*   $P(X)$ is the evidence: the probability of observing feature vector $X$ independently.

For a feature vector $X = (x_1, x_2, ..., x_n)$, Bayes' theorem becomes:

$P(C|x_1, ..., x_n) = \frac{P(x_1, ..., x_n|C)P(C)}{P(x_1, ..., x_n)}$

The classifier then predicts the class with the highest posterior probability:

$C_{predict} = \text{argmax}_{C} \ P(C|X)$

**Why is it called "Naïve"?**

The "Naïve" part of the name comes from the fundamental simplification the algorithm makes: **it assumes that all features are independent of each other given the class.**

Mathematically, this means:

$P(x_1, ..., x_n|C) = P(x_1|C)P(x_2|C)...P(x_n|C)$

This assumption significantly simplifies the calculation of the likelihood term $P(X|C)$, making the model computationally efficient. Instead of having to calculate the joint probability of all features given the class, which would require an enormous amount of data and complex calculations, it can simply multiply the individual probabilities of each feature given the class.

 Implications of the "Naïve" Assumption:

*   **Simplicity and Speed:** The independence assumption drastically reduces the computational complexity, making Naïve Bayes very fast to train and predict, even with large datasets and many features.
*   **Robustness to Irrelevant Features:** If a feature is irrelevant, it theoretically doesn't affect the other features' probabilities given the class, so it shouldn't negatively impact the classification much.
*   **"Zero-Frequency Problem":** If a category for a feature (in the test data) was not observed in the training data, the likelihood for that feature will be zero, causing the entire posterior probability to be zero. Techniques like Laplace smoothing are used to address this.
*   **Performance:** Despite its overly simplistic assumption, Naïve Bayes often performs surprisingly well in practice, especially for tasks where the independence assumption holds reasonably true or when the dataset is not extremely complex. For instance, in spam detection, the presence of certain words (features) often indicates spam regardless of other words, making the independence assumption somewhat acceptable.
*   **Interpretability:** The model is relatively easy to understand, as the probabilities provide insights into the importance of different features for each class.



Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes

While all Naïve Bayes classifiers are based on the core principle of Bayes' Theorem with the assumption of feature independence, they differ primarily in the **assumptions they make about the distribution of the features** given the class. This leads to different ways of calculating the likelihood $P(x_i|C)$.

1. Gaussian Naïve Bayes

*   **Feature Type:** Assumes that continuous features associated with each class are distributed according to a **Gaussian (normal) distribution**.
*   **Likelihood Calculation:** It calculates the probability of a feature value by evaluating its probability density function (PDF) based on the mean and standard deviation of that feature for each class.
    *   $P(x_i | C) = \frac{1}{\sqrt{2\pi\sigma_{C,i}^2}} e^{ -\frac{(x_i - \mu_{C,i})^2}{2\sigma_{C,i}^2} }$
    where $\mu_{C,i}$ is the mean of feature $i$ given class $C$, and $\sigma_{C,i}^2$ is the variance of feature $i$ given class $C$.
*   **Use Cases:** Best suited for datasets where continuous numerical features are expected to follow a normal distribution. For example, classifying a customer as 'high-value' or 'low-value' based on their annual income, where income might be normally distributed within each group.
*   **Strengths:** Simple, fast, and works well for continuous data that is Gaussian or can be approximated as such.
*   **Weaknesses:** Assumes Gaussian distribution, which might not hold for all continuous data. Can be sensitive to outliers if not handled.

2. Multinomial Naïve Bayes

*   **Feature Type:** Assumes that features represent **counts or frequencies** (e.g., word counts in a document). It works with discrete data and is particularly well-suited for text classification problems where features are term frequencies or presence counts.
*   **Likelihood Calculation:** It assumes a multinomial distribution for the features. The likelihood $P(x_i|C)$ is calculated based on the proportion of times feature $x_i$ appears in class $C$ documents relative to the total number of features in class $C$.
    *   $P(x_i | C) = \frac{N_{C,i} + \alpha}{N_C + \alpha n}$
    where $N_{C,i}$ is the count of feature $i$ in class $C$ examples, $N_C$ is the total count of all features for class $C$, $n$ is the number of features, and $\alpha$ is a smoothing parameter (Laplace smoothing).
*   **Use Cases:** Widely used in Natural Language Processing (NLP) tasks such as spam detection, document classification, sentiment analysis (where features are typically word counts or TF-IDF values).
*   **Strengths:** Excellent for text classification, handles sparse data well, and is very fast.
*   **Weaknesses:** Not suitable for features that are not counts or frequencies. Performs poorly if features are binary or negative.

3. Bernoulli Naïve Bayes

*   **Feature Type:** Assumes that features are **binary (Boolean) variables**, meaning they can only take two values (e.g., presence or absence of a word). It is similar to Multinomial Naïve Bayes but explicitly models the non-occurrence of features.
*   **Likelihood Calculation:** It assumes a Bernoulli distribution. The likelihood $P(x_i|C)$ for a feature $x_i$ is calculated as the probability of its presence (1) or absence (0) given the class $C$.
    *   $P(x_i=1 | C) = P_{C,i}$ (probability of feature $i$ appearing in class $C$)
    *   $P(x_i=0 | C) = 1 - P_{C,i}$ (probability of feature $i$ not appearing in class $C$)
*   **Use Cases:** Also used in text classification, especially when dealing with very short texts or when only the presence/absence of words matters, rather than their frequency. For example, spam classification where a word either exists or not.
*   **Strengths:** Good for binary feature data, simple to implement, and handles non-occurrence of features explicitly.
*   **Weaknesses:** Only works for binary features. May lose information if feature frequencies are important.



Question 10: Breast Cancer Dataset

Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

This program will demonstrate how to train a Gaussian Naïve Bayes classifier. We will use the Breast Cancer dataset from `sklearn.datasets`, split it into training and testing sets, train the model, and then evaluate its accuracy on the test set.

In [3]:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
feature_names = breast_cancer.feature_names

print(f"Dataset features: {list(feature_names)}")
print(f"Shape of data (X): {X.shape}")
print(f"Shape of target (y): {y.shape}\n")

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize the Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# 4. Train the classifier on the training data
print("Training Gaussian Naïve Bayes classifier...")
gnb.fit(X_train, y_train)

# 5. Make predictions on the test data
y_pred = gnb.predict(X_test)

# 6. Evaluate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nGaussian Naïve Bayes Classifier Accuracy: {accuracy:.4f}")

Dataset features: [np.str_('mean radius'), np.str_('mean texture'), np.str_('mean perimeter'), np.str_('mean area'), np.str_('mean smoothness'), np.str_('mean compactness'), np.str_('mean concavity'), np.str_('mean concave points'), np.str_('mean symmetry'), np.str_('mean fractal dimension'), np.str_('radius error'), np.str_('texture error'), np.str_('perimeter error'), np.str_('area error'), np.str_('smoothness error'), np.str_('compactness error'), np.str_('concavity error'), np.str_('concave points error'), np.str_('symmetry error'), np.str_('fractal dimension error'), np.str_('worst radius'), np.str_('worst texture'), np.str_('worst perimeter'), np.str_('worst area'), np.str_('worst smoothness'), np.str_('worst compactness'), np.str_('worst concavity'), np.str_('worst concave points'), np.str_('worst symmetry'), np.str_('worst fractal dimension')]
Shape of data (X): (569, 30)
Shape of target (y): (569,)

Training Gaussian Naïve Bayes classifier...

Gaussian Naïve Bayes Classifier A