# Question 1 : What is Information Gain, and how is it used in Decision Trees?
-Information Gain in Decision Trees

Information Gain is a metric used in decision trees to decide which feature should be used to split the data at each node. It measures how much uncertainty (impurity) in the target variable is reduced after splitting the dataset based on a particular feature.

Information Gain is calculated using Entropy, which quantifies the randomness or disorder in the dataset. A dataset with mixed class labels has high entropy, while a pure dataset has low entropy.

When building a decision tree, the algorithm:

Calculates the entropy of the parent dataset.

Splits the dataset using a feature.

Computes the weighted entropy of the resulting child subsets.

Calculates Information Gain as the difference between parent entropy and child entropy.

The feature that provides the highest Information Gain is selected for the split because it best separates the data into distinct classes.

Why It Is Important

Information Gain helps decision trees:

Select the most informative features

Create purer child nodes

Build efficient and accurate classification models

#Question 2: What is the difference between Gini Impurity and Entropy?
-Gini Impurity vs Entropy

Gini Impurity and Entropy are two common metrics used in decision trees to measure the impurity or disorder of a dataset, guiding the algorithm in selecting the best feature to split the data. Gini Impurity calculates the probability of misclassifying a randomly chosen instance if it were labeled according to the class distribution. It is computationally simple, fast, and commonly used in CART decision trees. Entropy, on the other hand, comes from information theory and measures the amount of uncertainty or randomness in the dataset. It is slightly more sensitive to changes in class probabilities and is used in algorithms like ID3 and C4.5. While both metrics range from 0 (pure) to maximum impurity, Gini is generally faster to compute, whereas Entropy gives a more precise measure of information gain. In practice, the choice between the two depends on the dataset size and the algorithm being used: Gini is preferred for large datasets due to speed, and Entropy is preferred when a theoretically accurate measure of uncertainty is needed.

#Question 3:What is Pre-Pruning in Decision Trees?
-Pre-Pruning in Decision Trees

Pre-pruning (also called early stopping) is a technique used in decision tree algorithms to stop the growth of the tree early, before it perfectly classifies all training samples. The idea is to prevent the tree from becoming too complex and overfitting the training data.

During tree construction, the algorithm evaluates whether to split a node further based on predefined criteria, such as:

Maximum depth of the tree

Minimum number of samples required to split a node

Minimum information gain or impurity decrease threshold

If the node does not meet the criteria, the algorithm stops splitting, and the node becomes a leaf node.

Advantages of Pre-Pruning

Reduces overfitting, improving generalization to unseen data

Reduces tree complexity and training time

Makes the tree simpler and more interpretable

Disadvantages

If stopped too early, the tree might underfit, missing important patterns in the data

# Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).


In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

dt_model = DecisionTreeClassifier(criterion='gini', random_state=42)

dt_model.fit(X_train, y_train)

feature_importances = pd.Series(dt_model.feature_importances_, index=X.columns)
feature_importances = feature_importances.sort_values(ascending=False)

print("Feature Importances (Gini):")
print(feature_importances)

print("\nTop 5 Features:")
print(feature_importances.head(5))


Feature Importances (Gini):
mean concave points        0.705839
worst texture              0.114062
worst radius               0.070187
worst area                 0.036653
mean texture               0.022885
concave points error       0.017164
area error                 0.013563
worst smoothness           0.010492
concavity error            0.007152
smoothness error           0.002004
mean compactness           0.000000
mean area                  0.000000
mean perimeter             0.000000
mean radius                0.000000
perimeter error            0.000000
texture error              0.000000
radius error               0.000000
mean fractal dimension     0.000000
mean concavity             0.000000
mean symmetry              0.000000
mean smoothness            0.000000
compactness error          0.000000
symmetry error             0.000000
fractal dimension error    0.000000
worst perimeter            0.000000
worst compactness          0.000000
worst concavity            0.000000


#Question 5: What is a Support Vector Machine (SVM)?
-Support Vector Machine (SVM)

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It is especially popular for binary classification problems. SVM works by finding the best decision boundary (hyperplane) that separates data points of different classes with the maximum margin.

Key Concepts

Hyperplane: A line (in 2D), plane (in 3D), or higher-dimensional surface that separates the classes.

Support Vectors: The data points closest to the hyperplane. These points are critical in defining the position and orientation of the hyperplane.

Margin: The distance between the hyperplane and the nearest support vectors. SVM aims to maximize this margin, which improves generalization to new data.

How SVM Works

For linearly separable data: SVM finds the hyperplane that separates the two classes with the largest margin.

For non-linear data: SVM uses kernel functions (like RBF, polynomial, or sigmoid) to map the data into higher dimensions where a linear separation is possible.

Advantages

Effective in high-dimensional spaces

Works well when the number of features is greater than the number of samples

Robust against overfitting if the margin is maximized

Disadvantages

Can be computationally expensive for large datasets

Choosing the right kernel and hyperparameters can be tricky

#Question 6: What is the Kernel Trick in SVM?
-Kernel Trick in SVM

The Kernel Trick is a technique used in Support Vector Machines (SVM) to handle non-linearly separable data. Instead of trying to separate the data in the original feature space, the kernel trick maps the data into a higher-dimensional space where a linear hyperplane can separate the classes.

How It Works

Suppose the data cannot be separated by a straight line in 2D.

Using a kernel function, SVM implicitly transforms the data into a higher-dimensional space (e.g., 3D).

In this new space, a linear hyperplane can separate the classes.

The kernel allows SVM to compute inner products in high-dimensional space without explicitly transforming the data, which saves computation.

Common Kernel Functions
Kernel	Description
Linear	No transformation; used when data is already linearly separable.
Polynomial	Maps data into polynomial feature space.
RBF (Radial Basis Function / Gaussian)	Maps data into infinite-dimensional space; very flexible.
Sigmoid	Similar to neural network activation; less commonly used.

In [2]:
#Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

data = load_wine()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train_scaled, y_train)
y_pred_linear = svm_linear.predict(X_test_scaled)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train_scaled, y_train)
y_pred_rbf = svm_rbf.predict(X_test_scaled)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

print(f"SVM Linear Kernel Accuracy: {accuracy_linear:.4f}")
print(f"SVM RBF Kernel Accuracy: {accuracy_rbf:.4f}")

SVM Linear Kernel Accuracy: 0.9630
SVM RBF Kernel Accuracy: 0.9815


#Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?
-Naïve Bayes Classifier

The Naïve Bayes classifier is a probabilistic supervised learning algorithm based on Bayes’ Theorem, used for classification tasks. It predicts the probability that a given input belongs to a particular class based on the conditional probabilities of the features.

Bayes’ Theorem

Bayes’ Theorem forms the basis of this classifier:

P(C∣X)=
P(X)
P(X∣C)⋅P(C)
	​


Where:

P(C∣X) → Probability of class

C given features

X (posterior)

P(X∣C) → Probability of features given the class (likelihood)


P(C) → Prior probability of class
P(X) → Probability of the features (evidence)

Why is it called "Naïve"?

It is called “naïve” because the algorithm assumes that all features are independent of each other, given the class label. In real-world datasets, this assumption is often not true, but despite this simplification, Naïve Bayes works surprisingly well in practice.

Advantages

Simple and fast

Performs well on high-dimensional data

Works well for text classification (spam detection, sentiment analysis)

Disadvantages

The independence assumption may not hold in practice, which can affect accuracy

Not ideal for datasets with correlated features

#Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes

-Naïve Bayes classifiers have different variants depending on the type of data and the probability distribution assumed for the features.

Gaussian Naïve Bayes is used when the features are continuous numerical values. It assumes that each feature follows a Gaussian (normal) distribution. This makes it suitable for datasets with measurements such as height, weight, or lab values, like the Iris or Breast Cancer datasets.

Multinomial Naïve Bayes is designed for count or frequency data. It assumes that the features represent the number of times an event occurs, following a multinomial distribution. This is commonly used in text classification, where features are word counts in documents, such as spam detection or document categorization.

Bernoulli Naïve Bayes is used for binary data, where features indicate the presence or absence of a characteristic. It assumes each feature follows a Bernoulli distribution and is ideal for tasks such as checking whether a word appears in an email for spam detection.

In [3]:
#Question 10: Breast Cancer Dataset
#Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
#dataset and evaluate accuracy.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

caler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

gnb = GaussianNB()

gnb.fit(X_train_scaled, y_train)

y_pred = gnb.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print("Gaussian Naïve Bayes Accuracy:", accuracy)

print("\nClassification Report:\n", classification_report(y_test, y_pred))

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Gaussian Naïve Bayes Accuracy: 0.935672514619883

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.89      0.91        64
           1       0.94      0.96      0.95       107

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.93       171
weighted avg       0.94      0.94      0.94       171

Confusion Matrix:
 [[ 57   7]
 [  4 103]]
