# **Question 1 : What is Information Gain, and how is it used in Decision Trees?**
Information Gain measures the reduction in uncertainty (entropy) about the target variable after splitting the dataset using a particular feature.

- Higher Information Gain ⇒ Better Feature for Splitting

- Goal: Choose the feature that results in pure child nodes (i.e., nodes dominated by a single class)

HOW IT IS USED IN DECISION TREE
| Step | Role of Information Gain                                  |
| ---- | --------------------------------------------------------- |
| 1️⃣  | Check all features and calculate IG for each              |
| 2️⃣  | Choose the feature with **highest IG** to make a split    |
| 3️⃣  | Repeat recursively for sub-nodes until stopping condition |


# **Question 2: What is the difference between Gini Impurity and Entropy?**
(***Hint: Directly compares the two main impurity measures, highlighting strengths,weaknesses, and appropriate use cases.)***

Difference Between Gini Impurity and Entropy
| Feature                   | **Gini Impurity**                                                        | **Entropy**                                                   |
| ------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------- |
| **Definition**            | Measures likelihood of incorrectly classifying a randomly chosen element | Measures the amount of disorder or uncertainty in the dataset |
| **Formula**               | ( \text{Gini} = 1 - \sum p_i^2 )                                         | ( \text{Entropy} = - \sum p_i \log_2 p_i )                    |
| **Interpretation**        | Focuses on misclassification probability                                 | Focuses on information content (bit-based measure)            |
| **Range**                 | 0 (pure) to ~0.5 (binary max impurity)                                   | 0 (pure) to 1 (binary max impurity)                           |
| **Computation Speed**     | **Faster** — no logarithm involved                                       | **Slower** — uses logarithms                                  |
| **Bias During Splitting** | Prefers **larger** partitions (more stable splits)                       | Prefers splits that create **purer** smaller subsets          |
| **Commonly Used In**      | CART (Classification & Regression Trees)                                 | ID3, C4.5, C5.0 decision trees                                |
| **Best Use Case**         | Larger datasets and real-time models                                     | When purity matters more than speed                           |



#**Question 3:What is Pre-Pruning in Decision Trees?**

Pre-Pruning in Decision Trees

Pre-pruning (also called early stopping) is a technique used to stop the growth of a decision tree before it becomes too complex and starts overfitting the training data.

How It Works?

During tree construction, the algorithm evaluates splits at each node.
If a split does not significantly improve model performance, the algorithm:

-  Stops further splitting
-  Converts that node into a leaf node
-  Prevents over-complex branches

Common Pre-Pruning Criteria

A tree stops splitting if:
| Condition                                          | Meaning                                                |
| -------------------------------------------------- | ------------------------------------------------------ |
| **Minimum samples per node**                       | If too few samples, stop splitting                     |
| **Maximum tree depth**                             | Tree cannot grow beyond a set depth                    |
| **Minimum information gain or impurity reduction** | Split must sufficiently improve purity                 |
| **Validation set performance**                     | If performance doesn’t improve, prevent further splits |

PURPOSE
| Goal                     | Benefit                        |
| ------------------------ | ------------------------------ |
| Reduce overfitting       | Improves generalization        |
| Control model complexity | Faster training and prediction |
| Avoid useless branches   | Better interpretability        |


# **Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).**

***(Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_. (Include your Python code and output in the code box below.))***

This demonstrates:

-  Decision Tree using Gini Impurity
-  Training on a dataset
-  Printing feature importances clearly

In [1]:
# Question 4: Decision Tree Classifier using Gini Impurity

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Train Decision Tree with Gini criterion
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X, y)

# Get feature importances
importances = model.feature_importances_

# Display feature importances with labels
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("Feature Importances:")
print(importance_df)


Feature Importances:
             Feature  Importance
2  petal length (cm)    0.564056
3   petal width (cm)    0.422611
0  sepal length (cm)    0.013333
1   sepal width (cm)    0.000000


# **Question 5: What is a Support Vector Machine (SVM)?**

Support Vector Machine (SVM)

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for both classification and regression tasks, but most commonly for binary classification.

Core Idea:

SVM tries to find the best separating boundary (called hyperplane) between classes.

That hyperplane is chosen such that:

- It maximizes the margin — the distance between the hyperplane and the nearest data points from each class
-  These critical nearest points are called Support Vectors (hence the name)

KEY CONCEPT
| Term                | Meaning                                                      |
| ------------------- | ------------------------------------------------------------ |
| **Hyperplane**      | Decision boundary separating classes                         |
| **Margin**          | Distance between hyperplane and support vectors              |
| **Support Vectors** | Closest data points that influence the boundary              |
| **Kernel Trick**    | Maps data to higher dimensions to make it linearly separable |

Kernels in SVM

| Kernel         | Use Case                                     |
| -------------- | -------------------------------------------- |
| Linear         | Data is linearly separable                   |
| Polynomial     | Complex boundaries with polynomial patterns  |
| RBF (Gaussian) | Most common; handles nonlinear relationships |
| Sigmoid        | Neural network–like behavior                 |

WHY SVM?
| Strengths                             | Weaknesses                                 |
| ------------------------------------- | ------------------------------------------ |
| Works well with high-dimensional data | Slow for large datasets                    |
| Effective for clear margin separation | Sensitive to noise and overlapping classes |
| Flexible via kernels                  | Requires parameter tuning                  |


# **Question 6: What is the Kernel Trick in SVM?**

Kernel Trick in SVM

The Kernel Trick is a mathematical technique used in Support Vector Machines that allows us to:

-  Handle non-linear data
-  Without explicitly transforming data into a higher dimension

What Problem Does It Solve?

Many datasets cannot be separated by a straight line in the original input space.

Example:
Classes shaped like concentric circles → linear SVM fails

So, SVM maps data into a higher-dimensional space where the classes become linearly separable.

But explicitly transforming data (e.g., from 2D → 3D → 100D) is:

-  Very expensive
-  Sometimes impossible to compute

The Kernel Trick Solution

Instead of computing the actual transformation,
SVM uses a kernel function that calculates the dot product in the high-dimensional space directly from the original space.

Thus:

Complex classification boundaries are learned efficiently without heavy computation.

COMMON KERNEL FUNCTIONS
| Kernel             | When to Use                                      |
| ------------------ | ------------------------------------------------ |
| **Linear**         | Data is already linearly separable               |
| **Polynomial**     | Data has polynomial relationships                |
| **RBF (Gaussian)** | Most common; handles complex non-linear patterns |
| **Sigmoid**        | Similar to neural network activation             |


#**Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.**

***Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting on the same dataset.***

**(Include your Python code and output in the code box below.)**



In [2]:
# Question 7: SVM with Linear & RBF Kernels - Accuracy Comparison

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=42)

# SVM with Linear Kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
acc_linear = accuracy_score(y_test, y_pred_linear)

# SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

print("Accuracy Comparison:")
print(f"Linear Kernel SVM Accuracy: {acc_linear:.4f}")
print(f"RBF Kernel SVM Accuracy: {acc_rbf:.4f}")


Accuracy Comparison:
Linear Kernel SVM Accuracy: 0.9815
RBF Kernel SVM Accuracy: 0.7593


# **Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?**

Naïve Bayes Classifier

Naïve Bayes is a probabilistic, supervised machine learning algorithm based on Bayes’ Theorem.
It is widely used for classification tasks, especially in:

- Spam detection

- Sentiment analysis

- Text classification

- Medical diagnosis

How Does It Work?

It calculates the probability of each class for a given feature set and assigns the class with the maximum posterior probability.

P(Class∣Features)=
P(Features∣Class)⋅P(Class) / P(Features)	​

KEY CHARACTERISTICS
| Feature         | Description                                 |
| --------------- | ------------------------------------------- |
| Assumption      | Features are independent (the "naïve" part) |
| Efficiency      | Very fast training and prediction           |
| Works well with | High-dimensional data (e.g., text)          |
| Output          | Class with highest probability              |

Advantages

- Simple and fast

- Works surprisingly well even when independence assumption is violated

- Performs great in real-world text-based tasks

 Limitation

- If features are highly correlated, performance may drop

- Zero probability issue (handled using Laplace Smoothing)

WHY IS IT CALLED NANIVE?

Because it makes a strong and unrealistic assumption:

***All features are conditionally independent of each other given the class label.***

Example:
In text classification, it assumes each word appears independently — which is not always true.

# **Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes**

| Type of Naïve Bayes         | Assumption About Features                            | Suitable Data Type                | Common Use Cases                                    | Example Scenario                         |
| --------------------------- | ---------------------------------------------------- | --------------------------------- | --------------------------------------------------- | ---------------------------------------- |
| **Gaussian Naïve Bayes**    | Features follow a **normal (Gaussian) distribution** | Continuous / real-valued features | Classification with continuous measurements         | Height, weight, temperature, sensor data |
| **Multinomial Naïve Bayes** | Features are **counts** or **frequencies** of events | Discrete numeric data (≥ 0)       | Text classification, document term frequency models | Bag-of-Words, TF-IDF values              |
| **Bernoulli Naïve Bayes**   | Binary features: **0/1 presence or absence**         | Boolean indicators                | Spam detection with binary word occurrence          | Whether a specific word appears (yes/no) |


Key Points Summary

Gaussian NB

-  Best for continuous features
-  Assumes data fits a bell-curve distribution

Multinomial NB

-  Best for text with word frequency data
-  More sensitive to number of occurrences

Bernoulli NB
-  Best for binary features (present vs absent)
-  Evaluates whether a feature exists, not how many times

Final Understanding

***The core difference lies in the type of data each variant handles (continuous vs count vs binary), and the probability distribution assumed for the features.***

# **Question 10: Breast Cancer Dataset**

**Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.**

***Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from sklearn.datasets.***

**(Include your Python code and output in the code box below.)**



In [3]:

# Question 10: Gaussian Naïve Bayes on Breast Cancer Dataset

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data into Train & Test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create and train the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predictions
y_pred = gnb.predict(X_test)

# Model Accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy on Breast Cancer Dataset:")
print(f"Gaussian Naïve Bayes Accuracy: {accuracy:.4f}")


Model Accuracy on Breast Cancer Dataset:
Gaussian Naïve Bayes Accuracy: 0.9415
