## Question 1: What is Information Gain, and how is it used in Decision Trees?

Answer : Information Gain (IG) is a core concept used in training Decision Trees. It is a metric that measures how much "purity" or order is gained by splitting a set of data based on a specific feature.
1. Definition
- Information: In this context, information refers to the reduction in uncertainty or entropy. A dataset with high randomness (e.g., a mix of 50% "Yes" and 50% "No" labels) has high entropy.
- Information Gain: IG calculates the difference between the initial impurity (or entropy) of the data before the split, and the weighted average of the impurity of the two or more resulting subsets after the split.$$\text{Information Gain} = \text{Entropy}(\text{Parent}) - \sum_{i=1}^{k} \frac{\text{Number of Samples in Child } i}{\text{Total Samples}} \times \text{Entropy}(\text{Child } i)$$
2. Usage in Decision TreesInformation Gain is the primary criterion used by algorithms like ID3 and C4.5 to build the tree.
- Feature Selection: At every node in the tree, the algorithm tests all available features for the best possible split.
- Optimal Split: The feature that results in the highest Information Gain is chosen as the splitting condition for that node. The goal is always to maximize IG, which means finding the split that leads to the purest child nodes (nodes where almost all samples belong to one class).
- Tree Growth: This process is repeated recursively on the resulting child nodes until no further gain can be made, or a stopping criterion is met.

## Question 2: What is the difference between Gini Impurity and Entropy?

Answer : Gini Impurity vs. Entropy
| Feature | Gini Impurity | Entropy |
|-----------|-----------|-----------|
| Formula Concept | Measures the probability of incorrectly classifying a randomly chosen element in the dataset. | Measures the uncertainty or randomness in a dataset. Based on information theory. |
| Range | 0 to 0.5 (0 means pure, 0.5 means maximum impurity for a binary class). | 0 to 1 (0 means pure, 1 means maximum impurity for a binary class). |
| Computation | Involves squaring probabilities. Computationally faster as it avoids logarithm calculations. | Involves logarithm calculations. Computationally slower than Gini. |
| Goal of Split | Choose the split that results in the lowest Gini Impurity. | Choose the split that results in the highest Information Gain (greatest reduction in Entropy). |
| Common Use | Default and preferred metric in algorithms like CART (Classification and Regression Trees), which is used by scikit-learn. | Used historically in algorithms like ID3 and C4.5. |

## Question 3: What is Pre-Pruning in Decision Trees?

Answer : Pre-Pruning (or early stopping) is a technique used to prevent a Decision Tree from growing too large and complex, which helps to avoid the problem of overfitting to the training data.

How Pre-Pruning Works

Instead of building a full tree and then cutting it back (which is Post-Pruning), Pre-Pruning stops the tree building process early by defining strict rules before or during the splitting process at each node.

The algorithm checks these rules at a node before attempting a split:

1. Maximum Depth: The tree is not allowed to grow beyond a set number of levels (e.g., maximum depth of 5).

2. Minimum Samples Per Split: A node will only be split if it contains a minimum required number of data points (e.g., only split a node if it has at least 20 samples).

3. Minimum Impurity Decrease: A split is only performed if it results in a reduction of impurity (like Gini Impurity or Entropy) that is greater than a specified threshold. If the gain from the split is too small, the split is rejected, and the node remains a leaf node.

## Question 4: Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).

In [1]:
'''Question 4: Write a Python program to train a Decision Tree Classifier using Gini Impurity as
the criterion and print the feature importances (practical).'''

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load Data
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Initialize and Train the Model
dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_classifier.fit(X_train, y_train)

# 4. Predict and Evaluate
y_pred = dt_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 5. Get and Print Feature Importances
importances = dt_classifier.feature_importances_
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

print("--- Decision Tree Results ---")
print(f"Test Accuracy: {accuracy:.4f}\n")
print("Feature Importances (Gini):")
print(importance_df.to_string(index=False))

--- Decision Tree Results ---
Test Accuracy: 1.0000

Feature Importances (Gini):
          Feature  Importance
petal length (cm)    0.893264
 petal width (cm)    0.087626
 sepal width (cm)    0.019110
sepal length (cm)    0.000000


## Question 5: What is a Support Vector Machine (SVM)?

Answer : A Support Vector Machine (SVM) is a powerful and versatile machine learning algorithm used for both classification and regression tasks. Its strength lies in its ability to handle complex, high-dimensional data.

#### Core Concept: The Hyperplane
For classification, the primary goal of an SVM is to find the best possible decision boundary, which is called a hyperplane, that separates the data points of different classes.

- In 2D data, the hyperplane is simply a line.

- In 3D data, the hyperplane is a flat plane.

- In higher dimensions, it is a higher-dimensional flat subspace.

#### The "Optimal" Hyperplane
The "best" hyperplane is the one that has the largest margin.

- Margin: The distance between the hyperplane and the closest data points from either class.

- Support Vectors: The data points that lie closest to the hyperplane and determine the position and orientation of the margin are called the Support Vectors.

By maximizing this margin, the SVM aims for the best possible separation between the classes, which generally leads to better generalization (less chance of misclassifying unseen data).

## Question 6: What is the Kernel Trick in SVM?

Answer : The Kernel Trick is one of the most powerful concepts that makes Support Vector Machines so effective, especially with non-linearly separable data (data that cannot be separated by a single straight line).

#### The Problem

If the data is non-linear (e.g., data points of one class form a circle around data points of another class), no simple hyperplane (line or plane) in the original low-dimensional space can separate them accurately.

#### The Solution: Mapping to Higher Dimensions

1. Implicit Transformation: The Kernel Trick uses a kernel function (like the Radial Basis Function, RBF) to mathematically compute the similarity between pairs of data points as if they had been mapped to a much higher-dimensional feature space, without ever explicitly performing the costly calculations of the transformation itself.

2. Linear Separation: In this higher-dimensional space, the data points that were tangled up in the low-dimensional space often become linearly separable.

3. Efficiency: This "trick" allows the SVM to find a linear boundary (a hyperplane) in the high-dimensional space, which corresponds to a complex, non-linear decision boundary when projected back into the original low-dimensional space. This provides a non-linear classifier without the computational complexity of explicitly working with massive, high-dimensional data.

The most common kernel functions are Linear, Polynomial, and Radial Basis Function (RBF).

## Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies

In [2]:
'''Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on
the Wine dataset, then compare their accuracies.'''

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# 1. Load and Prepare Data
wine = load_wine()
X = wine.data
y = wine.target

# Standardize the data (important for SVM)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# 2. Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
acc_linear = accuracy_score(y_test, y_pred_linear)

# 3. Train SVM with RBF (Radial Basis Function) Kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# 4. Compare Accuracies
print("--- SVM Kernel Comparison (Wine Dataset) ---")
print(f"Accuracy with Linear Kernel: {acc_linear:.4f}")
print(f"Accuracy with RBF Kernel:    {acc_rbf:.4f}")

--- SVM Kernel Comparison (Wine Dataset) ---
Accuracy with Linear Kernel: 0.9815
Accuracy with RBF Kernel:    0.9815


## Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

#### Answer : What is the Naïve Bayes Classifier?

The Naïve Bayes classifier is a simple yet effective classification algorithm based on Bayes' Theorem from probability theory. It's primarily used for tasks like text classification (e.g., spam filtering, sentiment analysis) and disease prediction.
- Bayes' Theorem: This theorem allows the algorithm to calculate the probability of a specific class (e.g., "Spam") given a set of features (e.g., the words in the email).
$$\text{P}(\text{Class}|\text{Features}) = \frac{\text{P}(\text{Features}|\text{Class}) \times \text{P}(\text{Class})}{\text{P}(\text{Features})}$$
The classifier predicts the class that has the highest posterior probability, $\text{P}(\text{Class}|\text{Features})$.
#### Why is it called "Naïve"?
The term "Naïve" comes from the major simplifying assumption the model makes, which is often an unrealistically strong assumption in real-world data:

- Assumption of Conditional Independence: The model assumes that all the features ($X_1, X_2, X_3, \dots$) used to predict the class are completely independent of each other, given the class
- Example: In spam filtering, the Naïve Bayes model assumes that the probability of the word "buy" appearing in a spam email is independent of the probability of the word "now" appearing in the same spam email. In reality, these words often appear together, meaning they are dependent.

Despite this "naïve" and often incorrect assumption, the model performs surprisingly well in many real-world scenarios, particularly because it simplifies the complex probability calculations needed.

## Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.

Answer :
| Variation | Data Type / Feature Distribution | Common Applications |
|-----------|-----------|-----------|
| Gaussian Naïve Bayes | Features are continuous numerical values (e.g., height, weight, cholesterol levels). | Assumes the features follow a normal (Gaussian) distribution (bell curve). Used for predicting things like gender, disease presence, or any classification problem with continuous, normally distributed data. |
| Multinomial Naïve Bayes | Features represent counts or frequencies (e.g., how many times a word appears in a document). | Best suited for data where the features are discrete counts. Widely used in Text Classification (e.g., spam filtering, document categorization), where the features are word count vectors. |
| Bernoulli Naïve Bayes | Features are binary or Boolean (0 or 1, Yes or No, True or False). | Often used in Text Classification where the model only cares if a word is present (1) or absent (0) in a document, regardless of how many times it appears. |


## Question 10: Breast Cancer Dataset Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

In [3]:
'''Question 10: Breast Cancer Dataset Write a Python program to train a Gaussian Naïve Bayes classifier
on the Breast Cancer dataset and evaluate accuracy.'''

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# 1. Load Data
data = load_breast_cancer()
X = data.data
y = data.target

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Standardize Data (Improves performance, especially for GaussianNB)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Initialize and Train the Model
gnb_classifier = GaussianNB()
gnb_classifier.fit(X_train_scaled, y_train)

# 5. Predict and Evaluate
y_pred = gnb_classifier.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
# 6. Print Results
print("--- Gaussian Naïve Bayes (Breast Cancer Dataset) ---")
print(f"Number of Test Samples: {len(X_test)}")
print(f"Accuracy Score: {accuracy:.4f}")

--- Gaussian Naïve Bayes (Breast Cancer Dataset) ---
Number of Test Samples: 171
Accuracy Score: 0.9357
