# SVM & NAVIE BAYES

1. What is Information Gain, and how is it used in Decision Trees?

1. ans .
Information Gain (IG) measures the reduction in entropy (uncertainty or impurity) within a dataset after splitting it based on a specific feature. It is used in Decision Tree algorithms (like ID3) to select the best feature for splitting nodes, maximizing homogeneity in resulting child nodes. The feature with the highest information gain is chosen for the split to optimize classification accuracy.
* Decision Tree Application:
* Root Node Selection: The algorithm calculates the Information Gain for every feature and selects the one that provides the highest gain.

Q2  * Gini Impurity and Entropy are both metrics used to measure node impurity in decision trees, guiding splits by reducing disorder. Gini (0 to 0.5) calculates the probability of incorrect classification, while Entropy (0 to 1) measures information disorder. Gini is generally faster, whereas Entropy can produce slightly more balanced trees.

*Use Gini Impurity when computational efficiency is critical or when dealing with imbalanced datasets.

*. Use Entropy if the tree requires a more granular, theoretically justified measure of information gain.


ans 3.

 Pre-pruning, or early stopping, is a decision tree optimization technique that halts tree growth during training before it becomes too complex, preventing overfitting. It works by setting restrictions—such as maximum depth or minimum samples per split—to stop building branches, resulting in smaller, more interpretable, and faster-trained models.

* Key Aspects of Pre-Pruning:

* How it Works: Instead of allowing the tree to grow until every node is pure, pre-pruning checks constraints at each node before splitting. If a split fails to meet criteria (e.g., node samples are too few), it becomes a leaf node.

* Advantages: Significantly reduces training time, memory usage, and the risk of overfitting.

* Disadvantages: It can lead to underfitting if the stopping criteria are too strict, preventing the model from capturing necessary patterns.

 Ans 4. To train a Decision Tree Classifier using the Gini impurity criterion and print the feature importances, you can use the popular Python library scikit-learn [1].



In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

def train_gini_decision_tree():
    """
    Trains a Decision Tree Classifier using Gini impurity and prints feature importances.
    """
    # 1. Load a sample dataset (Iris dataset is built into sklearn)
    print("Loading the Iris dataset...")
    iris = load_iris()
    X = iris.data
    y = iris.target
    feature_names = iris.feature_names
    print(f"Features: {feature_names}\n")

    # 2. Split the data into training and testing sets (optional but good practice)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # 3. Instantiate the Decision Tree Classifier with Gini criterion (default, but explicit)
    # criterion='gini' specifies the use of Gini impurity
    print("Initializing Decision Tree Classifier with criterion='gini'...")
    dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)

    # 4. Train the classifier
    print("Training the classifier...")
    dt_classifier.fit(X_train, y_train)
    print("Training complete.\n")

    # 5. Print the feature importances
    print("Feature Importances:")
    importances = dt_classifier.feature_importances_

    # Pair feature names with their importances for better readability
    feature_importances_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
    feature_importances_df = feature_importances_df.sort_values(by='Importance', ascending=False)

    # Format importance values to a specific number of decimal places
    feature_importances_df['Importance'] = feature_importances_df['Importance'].round(4)

    print(feature_importances_df.to_string(index=False))

    # Optional: Evaluate the model's accuracy
    accuracy = dt_classifier.score(X_test, y_test)
    print(f"\nModel Accuracy on test set: {accuracy:.4f}")

if __name__ == "__main__":
    train_gini_decision_tree()


Loading the Iris dataset...
Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Initializing Decision Tree Classifier with criterion='gini'...
Training the classifier...
Training complete.

Feature Importances:
          Feature  Importance
petal length (cm)      0.8933
 petal width (cm)      0.0876
 sepal width (cm)      0.0191
sepal length (cm)      0.0000

Model Accuracy on test set: 1.0000


5.
 Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and regression, finding the optimal boundary (hyperplane) to separate different classes in data by maximizing the margin between them, working well with complex, high-dimensional data by mapping it to higher spaces using kernel functions. It's used in text classification (spam detection), image recognition, and bioinformatic.

 *How SVM Works (Classification).

 * Find the Hyperplane: SVM identifies the best line (or plane/hyperplane in higher dimensions) that separates data points of different classes.

 * Maximize the Margin: It seeks the hyperplane with the widest possible gap (margin) to the nearest data points from each class, ensuring better generalization to new data.

6  *  . The Kernel Trick in SVM is a powerful technique that allows Support Vector Machines to efficiently classify non-linearly separable data by implicitly mapping it into a higher-dimensional space, where it becomes linearly separable, without actually computing the coordinates in that new space, saving massive computational effort. It replaces complex, high-dimensional dot products with a simple kernel function (like the Radial Basis Function or Polynomial kernel) that operates in the original, lower-dimensional space, providing a curved boundary in the input space.

* How it works:
The Problem: Data that isn't linearly separable in its original form (e.g., a circle of points inside another) requires complex, curved boundaries, which linear classifiers struggle with.

* The Idea (Mapping): The trick maps data points (e.g., from 2D to 3D or higher) into a new feature space where they can be separated by a straight line (hyperplane).

ANS 7 *

In [None]:
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# 2. Split the data into training and testing sets
# Using a common split of 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 3. Train an SVM classifier with a Linear kernel
print("Training Linear SVM...")
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
print("Linear SVM training complete.")

# 4. Train an SVM classifier with an RBF kernel
print("\nTraining RBF SVM...")
# RBF kernel often benefits from scaling, but for this basic example we skip it
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
print("RBF SVM training complete.")

# 5. Make predictions
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# 6. Compare accuracies
accuracy_linear = accuracy_score(y_test, y_pred_linear)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

print("\n--- Accuracy Comparison ---")
print(f"Linear Kernel SVM Accuracy: {accuracy_linear:.4f}")
print(f"RBF Kernel SVM Accuracy: {accuracy_rbf:.4f}")

# Optional: Determine which performed better
if accuracy_linear > accuracy_rbf:
    print("\nThe Linear Kernel SVM performed better on this dataset split.")
elif accuracy_rbf > accuracy_linear:
    print("\nThe RBF Kernel SVM performed better on this dataset split.")
else:
    print("\nBoth SVMs performed identically on this dataset split.")



ANS 8*

The Naive Bayes classifier is a fast, simple probabilistic model that predicts class membership by applying Bayes' theorem with a strong, unrealistic assumption that all features are conditionally independent of each other, given the class. It's called "naive" because this assumption of feature independence (e.g., words in an email don't affect each other) rarely holds true in real data, yet the model performs surprisingly well in practice, especially for text classification like spam filtering

* How it works.

* Based on Bayes' Theorem: It calculates the probability of a class given certain features

* The "Naive" Assumption: It simplifies calculations by assuming each feature (like a word) contributes independently to the probability, ignoring complex relationships between features
* Efficient & Simple: This independence assumption drastically reduces computational complexity, making it very fast and easy to implement, even for high-dimensional data.

ANS9*

 Gaussian, Multinomial, and Bernoulli Naïve Bayes differ primarily in the type of data they handle: Gaussian assumes continuous, normally distributed data (e.g., height); Multinomial models discrete counts, ideal for text classification; and Bernoulli handles binary (0/1) features, focusing on feature presence or absence.

 * Data Type: Continuous/Numerical (real-valued).
 * Assumption: Features follow a normal (Gaussian) distribution.
 *Characteristic: Calculates mean and standard deviation for features to calculate probabilities.



ANS 10 *

 This Python program uses scikit-learn to load the breast cancer dataset, splits it into training and testing sets, trains a GaussianNB classifier, and evaluates its accuracy. The model predicts whether tumors are malignant or benign based on 30 features, often achieving over 90% accuracy.



In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target

# 2. Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize the Gaussian Naive Bayes classifier
clf = GaussianNB()

# 4. Train the classifier
clf.fit(X_train, y_train)

# 5. Make predictions
y_pred = clf.predict(X_test)

# 6. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.9737

Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Confusion Matrix:
[[40  3]
 [ 0 71]]
