In [None]:
Question 1 : What is Information Gain, and how is it used in Decision Trees?


Information Gain is a metric used in Decision Trees to measure how much a feature helps in reducing uncertainty or disorder (entropy) in the target variable when we split the data based on that feature. It quantifies the effectiveness of a feature in classifying the training data by calculating the reduction in entropy after the split.

Mathematically, it is defined as:

"Information Gain"(D,A)=H(D)-H(D∣A)

where

	H(D) is the entropy of the dataset before the split,

	H(D∣A) is the conditional entropy of the dataset after splitting based on feature A.

Entropy represents the impurity or variability in the dataset and is calculated as:

H(D)=-∑_(i=1)^n p_i 〖log⁡〗_2 p_i


where p_i is the probability of class i.

Information Gain thus captures how much a dataset's impurity decreases by splitting on a particular feature. The feature that results in the highest Information Gain is chosen for the split at each node of the Decision Tree, as it leads to nodes that are more homogeneous (i.e., containing mostly one class).

In summary, Information Gain is critical in Decision Trees because it guides the algorithm in selecting the best features for splitting, thereby creating efficient and accurate classification models.


Question 2: What is the difference between Gini Impurity and Entropy?


Gini Impurity and Entropy are both metrics used in Decision Trees to measure the impurity or disorder within a set of data points, guiding how the tree splits the data. They share a similar goal but differ in calculation, interpretation, and behavior:

Gini Impurity

	Measures the probability of a randomly chosen element being misclassified if it were labeled according to the distribution of classes in the node.
	Calculated as "Gini"=1-∑_(i=1)^C p_i^2, where p_i is the proportion of class i.

	Ranges from 0 (pure node) to 0.5 (maximum impurity for binary classification).

	Computationally simpler and faster since it does not use logarithms.
	Tends to isolate the most frequent class, sometimes favoring splits that focus on the dominant class.

	Often used in CART (Classification and Regression Tree) algorithms.

Entropy

	Measures the amount of uncertainty or disorder in the dataset.
	Calculated as "Entropy"=-∑_(i=1)^C p_i 〖log⁡〗_2 (p_i).

	Ranges from 0 (pure node) to 1 (maximum impurity for binary classification).
	Computationally more intensive due to logarithms.

	Tends to produce more balanced trees by focusing on reducing overall uncertainty.

	Commonly used in ID3 and C4.5 algorithms.

Main Differences

Feature             	Gini                        Impurity	Entropy
Interpretation	Probability of misclassification	Average information needed to classify
Range (binary class)	 0 to 0.5                        	0 to 1
Computation	Simpler,   no logarithms	      Uses logarithmic operations
Effect on splits	     Prefers dominant class separation	                  

                                            Prefers balanced splits
Used in algorithms    	CART	ID3,                    C4.5



Question 3:What is Pre-Pruning in Decision

Pre-Pruning in Decision Trees, also known as early stopping, is a technique that stops the growth of the tree early during the training process to prevent it from becoming overly complex and overfitting the training data. Instead of allowing the tree to grow fully and then pruning back, pre-pruning sets constraints that halt the splitting of nodes when certain conditions are met.

Common pre-pruning conditions include:

Limiting the maximum depth of the tree.

Setting a minimum number of samples required to split a node.

Defining a minimum number of samples that must be present in a leaf node.

Restricting the maximum number of features considered for splitting.

By applying these constraints, pre-pruning results in a simpler, less complex tree that is less likely to overfit and generally quicker to train. However, it may risk underfitting if the growth is stopped too early, missing patterns in the data.

In summary, pre-pruning prevents the growth of the tree beyond a point where further splitting does not significantly improve model performance, aiming for a balanced trade-off between complexity and accuracy

Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).


In [None]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train a Decision Tree Classifier using Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X, y)

# Print the feature importances
print("Feature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")

output

Feature Importances:
sepal length (cm): 0.0123
sepal width (cm): 0.0000
petal length (cm): 0.5584
petal width (cm): 0.4293


Question 5: What is a Support Vector Machine (SVM)?

A Support Vector Machine (SVM) is a supervised machine learning algorithm primarily used for classification and regression tasks. It works by finding the optimal decision boundary, called a hyperplane, that best separates different classes of data points in the feature space. The key objective of SVM is to maximize the margin—the distance between the hyperplane and the nearest data points of each class, known as support vectors. A larger margin generally leads to better generalization on unseen data.

SVM can handle both linearly separable and non-linearly separable data by utilizing kernel functions that transform data into higher-dimensional spaces where a linear separation becomes possible. This kernel trick enables SVMs to create complex decision boundaries without explicitly computing the coordinates in the higher-dimensional space.

There are two main types of SVMs:

Linear SVM which uses a linear hyperplane to separate data.

Non-linear SVM which uses kernels like polynomial, Gaussian (RBF), or sigmoid to handle more complex data patterns.

SVM also supports soft margins, allowing some misclassifications to improve the model’s ability to generalize when data is noisy or not perfectly separable.

In summary, an SVM is a powerful classification algorithm that finds the optimal separating hyperplane by maximizing the margin between classes, supporting linear and nonlinear data separation through the kernel trick.

Question 6: What is the Kernel Trick in SVM?


The Kernel Trick in Support Vector Machines (SVM) is a technique that allows SVMs to efficiently handle non-linearly separable data by implicitly mapping the input features into a higher-dimensional space where the data can be linearly separated. Instead of explicitly computing the coordinates of data points in this high-dimensional feature space—which would be computationally expensive—the kernel trick uses kernel functions to directly compute the inner products of the transformed data points in this space.

This means the SVM algorithm operates as if the data were transformed into a higher dimension to make it separable, but without ever performing the explicit transformation. Common kernel functions include:

	Linear kernel (no transformation, for already linearly separable data)

	Polynomial kernel (maps features into polynomial feature space)

	Radial Basis Function (RBF or Gaussian) kernel (handles complex decision boundaries)

	Sigmoid kernel (similar to neural networks)

Mathematically, the kernel trick replaces the inner product ϕ(x_i)⋅ϕ(x_j) in the higher-dimensional space with a kernel function K(x_i,x_j) computed in the original space.

This approach vastly expands the power of SVMs to classify data that are not linearly separable in the original feature space while keeping computation efficient. It enables SVMs to learn very complex boundaries without losing the advantages of the algorithm's theoretical foundation.

In summary, the Kernel Trick allows SVMs to solve non-linear classification problems by implicitly mapping features to higher dimensions through kernel functions, avoiding heavy computation and enabling flexible, powerful classification


Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.

In [None]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM with Linear kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

# Print accuracies
print("SVM with Linear Kernel Accuracy:", accuracy_linear)
print("SVM with RBF Kernel Accuracy:", accuracy_rbf)

Output

SVM with Linear Kernel Accuracy: 0.9815
SVM with RBF Kernel Accuracy: 0.9074


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?


The Naïve Bayes classifier is a supervised machine learning algorithm that classifies data points based on probabilities derived from Bayes’ Theorem. It predicts the category of a data point by calculating the probability that it belongs to each class and then choosing the class with the highest probability. The classification is done by assuming that all features (predictors) are conditionally independent of each other given the class label.

It is called "Naïve" because this assumption of feature independence is highly simplistic and often unrealistic in real-world data where features can be correlated. Despite this "naïve" assumption, the classifier performs surprisingly well in many applications like spam filtering, text classification, and sentiment analysis due to its simplicity, efficiency, and relatively good accuracy.

In essence, the Naïve Bayes classifier combines:

Bayes’ theorem to compute the posterior probability of each class given the features.

The naive independence assumption, treating each feature’s contribution to the class probability as independent.

This simplicity allows it to train quickly and requires less data to estimate parameters compared to more complex classifiers, but it may produce less accurate probability estimates when feature dependencies are strong.

To summarize, Naïve Bayes is a probabilistic classifier named for its "naïve" assumption of feature independence, which enables efficient and effective classification despite the unrealistic simplification

Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes


Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes are three variants of the Naïve Bayes classifier, each suited for different types of data based on the underlying assumptions about feature distributions:

Gaussian Naïve Bayes assumes that continuous features follow a Gaussian (normal) distribution. It is commonly used when the data have continuous values like height, weight, or any measurements that can be modeled by bell-shaped curves. For example, it can be used for classifying iris flower species based on sepal and petal dimensions.

Multinomial Naïve Bayes is designed for discrete count data such as word frequencies in text classification. It models the features as multinomially distributed, which means it considers how many times each feature (e.g., a word) occurs in a document. This makes it suitable for tasks like spam detection or document categorization based on term counts.

Bernoulli Naïve Bayes works with binary/Boolean features indicating the presence or absence of an attribute (e.g., whether a word occurs or not in a document). It is useful when data are represented as binary vectors, for instance in text classification problems where features indicate whether a word appears in a document regardless of its frequency.

Question 10: Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

In [None]:
# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the results
print("Gaussian Naive Bayes Classifier Accuracy:", accuracy)

Output
Gaussian Naive Bayes Classifier Accuracy: 0.9532
