### **Question 1:** What is Information Gain, and how is it used in Decision Trees?

**Answer:**  
**Information Gain (IG)** measures how much uncertainty (entropy) in the target variable is reduced after splitting the data based on a particular feature.  
It helps Decision Trees decide **which attribute to split on at each node**.

Mathematically,  
\[
IG(T, X) = Entropy(T) - \sum_{v \in Values(X)} \frac{|T_v|}{|T|} \times Entropy(T_v)
\]

Where:  
- **T:** The entire dataset  
- **X:** The attribute being tested  
- **Tᵥ:** The subset of data where X has value v  

A feature with higher Information Gain is more effective in classifying the data correctly.  
Hence, the algorithm picks the attribute with the **highest Information Gain** for splitting at each step.


### **Question 2:** What is the difference between Gini Impurity and Entropy?

**Answer:**  
Both **Gini Impurity** and **Entropy** are measures of impurity used in Decision Trees to evaluate the quality of splits.

| Criterion | Formula | Range | Best (Pure) Value | Interpretation |
|------------|----------|--------|------------------|----------------|
| **Entropy** | \(-\sum p_i \log_2(p_i)\) | 0 to 1 | 0 | Measures disorder or randomness in data. |
| **Gini Impurity** | \(1 - \sum p_i^2\) | 0 to 0.5 (for binary) | 0 | Measures the probability of incorrect classification. |

**Comparison:**  
- **Entropy** uses logarithms and gives more weight to rare classes.  
- **Gini** is computationally faster since it avoids log calculations.  
- Both usually produce similar results, but **Gini** is preferred for speed and **Entropy** for interpretability.


### **Question 3:** What is Pre-Pruning in Decision Trees?

**Answer:**  
**Pre-pruning** (also known as **early stopping**) is a technique used to **stop the tree from growing too deep**.  
Instead of allowing the tree to fully grow and then trimming it (post-pruning), pre-pruning applies restrictions during training.

Common pre-pruning conditions include:  
- Maximum depth of the tree (`max_depth`)  
- Minimum number of samples to split (`min_samples_split`)  
- Minimum leaf node size (`min_samples_leaf`)  
- Minimum information gain threshold  

This method helps prevent **overfitting**, reduces computational cost, and improves model generalization.


### **Question 4:** Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances.


In [2]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd


iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target


clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X, y)


for feature, importance in zip(X.columns, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


sepal length (cm): 0.0133
sepal width (cm): 0.0000
petal length (cm): 0.5641
petal width (cm): 0.4226


### **Question 5:** What is a Support Vector Machine (SVM)?

**Answer:**  
A **Support Vector Machine (SVM)** is a supervised learning algorithm used for both **classification and regression** tasks.  
It works by finding the **best hyperplane** that separates data points of different classes with the **maximum margin**.  

The data points that are closest to this boundary are called **support vectors** — they are critical for defining the decision boundary.

**Key Advantages:**  
- Works well in high-dimensional spaces.  
- Effective even when the number of features exceeds the number of samples.  
- Uses different kernel functions to handle non-linear data.


### **Question 6:** What is the Kernel Trick in SVM?

**Answer:**  
The **Kernel Trick** allows SVMs to handle **non-linear relationships** without explicitly transforming data into higher dimensions.  

A **kernel function** computes similarity between data points in an implicit high-dimensional feature space.  
Common kernel types include:  
- **Linear Kernel:** Used for linearly separable data.  
- **Polynomial Kernel:** Maps data into polynomial feature space.  
- **RBF (Radial Basis Function) Kernel:** Handles complex, non-linear boundaries.  

In short, the kernel trick enables SVMs to **separate complex datasets efficiently** without ever computing the higher-dimensional transformation directly.


### **Question 7:** Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.


In [3]:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    wine.data, wine.target, test_size=0.3, random_state=42)

svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
linear_acc = accuracy_score(y_test, svm_linear.predict(X_test))


svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
rbf_acc = accuracy_score(y_test, svm_rbf.predict(X_test))

print("Linear Kernel Accuracy:", round(linear_acc, 3))
print("RBF Kernel Accuracy:", round(rbf_acc, 3))


Linear Kernel Accuracy: 0.981
RBF Kernel Accuracy: 0.759


### **Question 8:** What is the Naïve Bayes classifier, and why is it called "Naïve"?

**Answer:**  
The **Naïve Bayes classifier** is a probabilistic model based on **Bayes’ Theorem**, which predicts class membership probabilities based on prior knowledge and observed data.

It is called “Naïve” because it **assumes all features are independent** of each other — an assumption rarely true in real-world data, yet surprisingly effective.

Formula:  
\[
P(C|X) = \frac{P(X|C) \times P(C)}{P(X)}
\]  
Where:  
- \(P(C|X)\): Posterior probability of class C given data X  
- \(P(X|C)\): Likelihood of data given class  
- \(P(C)\): Prior probability of the class  

Despite its simplicity, Naïve Bayes works well in text classification, spam filtering, and sentiment analysis.


### **Question 9:** Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.

**Answer:**  
| Type | Suitable For | Key Idea | Example Use Case |
|------|---------------|-----------|------------------|
| **Gaussian NB** | Continuous data | Assumes data follows a normal (Gaussian) distribution | Predicting diseases using continuous lab results |
| **Multinomial NB** | Count-based data | Works with discrete features like word counts | Text classification, spam filtering |
| **Bernoulli NB** | Binary features | Features are either 0 or 1 (present/absent) | Sentiment analysis or document classification |

Each variant is optimized for different data types but follows the same core principle of Bayes’ theorem.


### **Question 10:** Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.


In [4]:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score


data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.3, random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)


y_pred = gnb.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy of Gaussian Naive Bayes:", round(acc, 3))


Accuracy of Gaussian Naive Bayes: 0.942
