# SVM & Navie Byes Assignment
---

## Question 1 : What is Information Gain, and how is it used in Decision Trees?

### **Answer:** Information Gain is an important concept used in **Decision Tree algorithms** to decide **which feature should be selected for splitting the data at each node**. It is based on the idea of **reducing uncertainty** in the dataset. In decision trees, uncertainty is measured using **entropy**, and Information Gain tells us **how much entropy is reduced after a dataset is split on a particular feature**.

Entropy is a measure of randomness or impurity in a dataset. If all data points belong to the same class, entropy is zero, meaning there is no uncertainty. However, if the data points are evenly distributed among different classes, entropy is high. Information Gain is calculated as the **difference between the entropy before the split and the weighted entropy after the split**.

Mathematically, entropy is defined as:
[
Entropy(S) = -\sum p_i \log_2(p_i)
]
where ( p_i ) is the probability of class ( i ) in dataset ( S ).

Information Gain is calculated using the formula:
[
IG(S, A) = Entropy(S) - \sum \frac{|S_v|}{|S|} Entropy(S_v)
]
where ( A ) is the feature used for splitting and ( S_v ) represents the subsets formed after the split.

In a Decision Tree, Information Gain is used during the **tree construction process**. At each node, the algorithm calculates the Information Gain for all possible features and selects the feature with the **highest Information Gain** for splitting. This ensures that the chosen feature provides the **maximum reduction in uncertainty**, leading to purer child nodes and a more effective tree structure.

The main advantage of using Information Gain is that it helps build **compact and meaningful decision trees** by choosing features that best separate the classes. However, one limitation of Information Gain is that it tends to favor features with **many distinct values**, which can sometimes lead to overfitting. To address this issue, alternative measures such as **Gain Ratio** are used.

In conclusion, Information Gain is a fundamental criterion in Decision Trees that measures the **effectiveness of a feature in classifying data**. By selecting splits that maximize Information Gain, decision trees become more accurate, efficient, and interpretable.

---

## Question 2: What is the difference between Gini Impurity and Entropy? Hint: Directly compares the two main impurity measures, highlighting strengths,weaknesses, and appropriate use cases.

### **Answer:**

Gini Impurity and Entropy are two commonly used **impurity measures** in **Decision Tree algorithms** to evaluate the quality of a split. Both measures quantify how **mixed or impure** a dataset is, and they help the decision tree decide **which feature should be chosen at each node**. Although they serve the same purpose, they differ in calculation, interpretation, and practical use.

**Entropy** is derived from **information theory** and measures the amount of **uncertainty or randomness** in a dataset. If all instances in a node belong to a single class, entropy is zero, indicating complete purity. As class distribution becomes more uniform, entropy increases. Entropy is commonly used in the **ID3 and C4.5 algorithms** and focuses on maximizing **Information Gain** during splitting.

On the other hand, **Gini Impurity** measures the probability of **incorrectly classifying a randomly chosen data point** if it were labeled according to the class distribution in that node. Like entropy, Gini impurity is zero when all data points belong to one class. However, it increases as the class distribution becomes more mixed. Gini impurity is widely used in the **CART (Classification and Regression Trees)** algorithm.

From a computational perspective, **Gini impurity is faster to calculate** because it does not involve logarithmic computations, while entropy requires logarithms, making it slightly more expensive computationally. Due to this reason, Gini impurity is often preferred in large datasets and real-time applications.

In terms of behavior, **entropy tends to create more balanced splits**, while **Gini impurity may isolate the most frequent class more quickly**. However, in practical scenarios, both measures often produce **very similar trees**, and the difference in performance is usually minimal.

In conclusion, both Gini impurity and entropy are effective impurity measures for decision trees. Entropy is more theoretically grounded in information theory, while Gini impurity is computationally efficient. The choice between them depends on **algorithm preference, dataset size, and computational constraints** rather than major differences in predictive performance.

### **Comparison Table:**

| Aspect             | Gini Impurity                 | Entropy                   |
| ------------------ | ----------------------------- | ------------------------- |
| Concept            | Misclassification probability | Measure of uncertainty    |
| Formula            | (1 - \sum p_i^2)              | (-\sum p_i \log_2 p_i)    |
| Value Range        | 0 to 0.5 (binary)             | 0 to 1 (binary)           |
| Computational Cost | Lower (no logs)               | Higher (log calculations) |
| Algorithms         | CART                          | ID3, C4.5                 |
| Split Behavior     | Favors dominant class         | Produces balanced splits  |
| Practical Use      | Faster, large datasets        | More theoretical clarity  |

---


## Question 3:What is Pre-Pruning in Decision Trees?

### Answer:Pre-pruning in Decision Trees is a technique used to stop the growth of a decision tree early in order to prevent overfitting. Instead of allowing the tree to grow fully, pre-pruning applies certain stopping criteria during tree construction so that unnecessary or insignificant splits are avoided.

In pre-pruning, the tree stops splitting a node when conditions such as maximum tree depth, minimum number of samples required to split a node, minimum information gain, or minimum samples in a leaf node are met. By restricting further splits, the model becomes simpler, faster, and more generalizable to unseen data.

In summary, pre-pruning helps control model complexity and reduces overfitting, but if applied too aggressively, it may lead to underfitting.

---

## Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_. (Include your Python code and output in the code box below.)

In [1]:
# Step 1: Import libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Step 2: Load dataset (Iris dataset as example)
iris = load_iris()
X = iris.data   # Features
y = iris.target # Labels

# Step 3: Create Decision Tree Classifier using Gini Impurity
dtree = DecisionTreeClassifier(criterion='gini', random_state=42)

# Step 4: Train the model
dtree.fit(X, y)

# Step 5: Print feature importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, dtree.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Feature Importances:
sepal length (cm): 0.0133
sepal width (cm): 0.0000
petal length (cm): 0.5641
petal width (cm): 0.4226


## Question 5: What is a Support Vector Machine (SVM)?

### Answer:A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression. It works by finding the best boundary or hyperplane that separates data points of different classes. The support vectors are the closest points to this boundary, and SVM maximizes the margin between the classes for better accuracy. If the data is not linearly separable, SVM can use kernel functions to transform it into a higher-dimensional space where a separation is possible. SVM is especially useful for high-dimensional and complex datasets because it focuses only on the critical points that define the boundary.

---

## Question 6: What is the Kernel Trick in SVM?

### Answer:The Kernel Trick is a technique used in Support Vector Machines (SVM) to handle data that is not linearly separable. Sometimes, you cannot draw a straight line (or hyperplane) to separate two classes in the original feature space. The kernel trick transforms the data into a higher-dimensional space where the classes can be separated linearly, without explicitly computing the transformation for every point.

**Key Points:**

* It allows SVM to find complex boundaries without heavy computation.

* Common kernels include:

* Linear Kernel – for linearly separable data

* Polynomial Kernel – for curved boundaries

* RBF (Radial Basis Function) Kernel – for complex patterns

* The trick helps SVM work efficiently on non-linear data.

**Example:**
Imagine red and blue dots forming a circle inside another circle. In 2D, you cannot separate them with a straight line. Using the kernel trick, SVM can project the points into 3D, where a plane can separate the classes perfectly.

---

## Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.
Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting
on the same dataset.
(Include your Python code and output in the code box below.)

In [2]:
# Step 1: Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 2: Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Step 3: Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Create SVM classifiers
svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)

# Step 5: Train the models
svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

# Step 6: Make predictions
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Step 7: Calculate and compare accuracies
accuracy_linear = accuracy_score(y_test, y_pred_linear)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

print(f"Accuracy of Linear Kernel SVM: {accuracy_linear:.4f}")
print(f"Accuracy of RBF Kernel SVM: {accuracy_rbf:.4f}")


Accuracy of Linear Kernel SVM: 0.9815
Accuracy of RBF Kernel SVM: 0.7593


## Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

## Answer:**Naïve Bayes Classifier:**

The **Naïve Bayes classifier** is a **probabilistic machine learning algorithm** used for classification tasks. It is based on **Bayes’ Theorem**, which calculates the probability of a class given certain features.

It is called **“Naïve”** because it **assumes that all features are independent of each other**, even if in reality they might be related. This simplification makes the algorithm fast and efficient, especially for **large datasets**, like in spam detection, text classification, or sentiment analysis.

**In simple words:**
Imagine trying to guess if an email is spam based on words it contains. Naïve Bayes assumes each word contributes **independently** to the chance of being spam, ignoring any relationship between words.

---

## Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.

### Answer:**Differences between Gaussian, Multinomial, and Bernoulli Naïve Bayes:**

| Variant                     | Data Type                 | Key Idea                                                      | Typical Use Case                                                                  |
| --------------------------- | ------------------------- | ------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| **Gaussian Naïve Bayes**    | Continuous numerical data | Assumes **features follow a Gaussian (normal) distribution**  | Predicting outcomes with real-valued data, e.g., height, weight, temperature      |
| **Multinomial Naïve Bayes** | Discrete count data       | Uses **feature counts** (how many times a feature occurs)     | Text classification, document classification, spam detection (word counts)        |
| **Bernoulli Naïve Bayes**   | Binary/Boolean data       | Considers **whether a feature is present or absent** (0 or 1) | Text classification with **binary features**, e.g., whether a word appears or not |

**Summary in simple words:**

* **Gaussian NB** → Good for **numbers** (continuous features).
* **Multinomial NB** → Good for **counts** (how many times something occurs).
* **Bernoulli NB** → Good for **yes/no data** (presence or absence of something).

---


## Question 10: Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.
Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.
(Include your Python code and output in the code box below.)

In [3]:
# Step 1: Import libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Step 2: Load Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Step 3: Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Create Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Step 5: Train the model
gnb.fit(X_train, y_train)

# Step 6: Make predictions
y_pred = gnb.predict(X_test)

# Step 7: Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Gaussian Naive Bayes on Breast Cancer dataset: {accuracy:.4f}")


Accuracy of Gaussian Naive Bayes on Breast Cancer dataset: 0.9415
