##Answer 1

**Definition**

Information Gain is the reduction in entropy (uncertainty or impurity) achieved by splitting the dataset based on a feature.


**How Information Gain Works in Decision Trees**

1. Calculate the entropy of the entire dataset (before splitting).

2. For each feature:

* Split the data based on that feature’s possible values.

* Compute the entropy of each subset.

* Calculate the weighted average entropy after the split.

* Find the Information Gain.

3. Choose the feature with the highest Information Gain to make the split — this feature gives the most “information” about the target.

##Answer 2

Here are the key differences between Gini Impurity and Entropy;

1. Gini Impurity:

* Used squares (probability of miss classification)

* Computation is faster.

* Often gives similar results to Entropy.

2. Entropy:

* Uses logarithms (information theory)

* Computation is slightly slower.

* Can favor splits with more distinct values.

##Answer 3

**What It Means**

When building a Decision Tree, the algorithm keeps splitting the data into smaller and smaller subsets to make predictions more accurate.
However, if it splits too much, the tree starts to memorize noise instead of learning general patterns — this is called overfitting.

* Pre-Pruning stops the tree from growing once certain conditions are met, to keep it simpler and more generalizable.

In [None]:
##Answer 4

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load a sample dataset (Iris dataset)
data = load_iris()
X = data.data          # Feature matrix
y = data.target        # Target labels

# Create a Decision Tree Classifier using Gini impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train (fit) the model
clf.fit(X, y)

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(data.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")

Feature Importances:
sepal length (cm): 0.0133
sepal width (cm): 0.0000
petal length (cm): 0.5641
petal width (cm): 0.4226


##Answer 5

SVM tries to find the **best boundary** (**hyperplane**) that separates data points of different classes with the **maximum margin**(the largest possible gap between the classes).

So, it’s all about finding a line (in 2D) or a plane/hyperplane (in higher dimensions) that clearly divides the classes.

##Answer 6

* **The Problem**

In a Linear SVM, the model finds a straight line (or hyperplane) to separate classes.
But what if the data can’t be separated by a straight line?

Example:
Imagine data shaped like two concentric circles — you can’t separate them with a line in 2D space.


* **The Idea Behind the Kernel Trick**

Instead of trying to separate the data in the original (low-dimensional) space,
SVM maps **the data to a higher-dimensional space**, where it becomes separable by a linear boundary.

This mapping is done implicitly using a Kernel Function, without actually computing the new coordinates — that’s the **“trick.”**


* **What Is a Kernel Function?**

A kernel function computes the dot product (similarity) between two data points in a higher-dimensional space, **without explicitly transforming** the data.

In [None]:
##Answer 7

# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data        # Features
y = data.target      # Labels

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Create two SVM classifiers
svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

# Train both classifiers
svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

# Make predictions
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Evaluate accuracy
acc_linear = accuracy_score(y_test, y_pred_linear)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# Print comparison results
print("SVM Classifier Comparison on Wine Dataset")
print("------------------------------------------")
print(f"Linear Kernel Accuracy: {acc_linear:.4f}")
print(f"RBF Kernel Accuracy:    {acc_rbf:.4f}")

# (Optional) Print which one performed better
if acc_linear > acc_rbf:
    print("\n✅ Linear Kernel performed better.")
elif acc_rbf > acc_linear:
    print("\n✅ RBF Kernel performed better.")
else:
    print("\n⚖️  Both kernels performed equally well.")

SVM Classifier Comparison on Wine Dataset
------------------------------------------
Linear Kernel Accuracy: 0.9444
RBF Kernel Accuracy:    0.6944

✅ Linear Kernel performed better.


##Answer 8

* **Definition**


The Naïve Bayes classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem, used for classification tasks.
It predicts the class of a given data point based on the probability of it belonging to each class.

* **Example**

Suppose we want to predict whether an email is **Spam or Not Spam**, based on words like:

“Free,” “Offer,” “Buy,” etc.


Even though words may appear together, Naïve Bayes assumes each word contributes **independently** to the probability of the email being spam.

* **Why Important**

1. Simple and fast to train


2.  Works well with high-dimensional data (like text)


3.  Requires small training data

4. Performs well even with the “naïve” independence assumption

##Answer 9

Here are the key differences pointwise between three;


1. **Gaussian Naïve Bayes**

* Assumes features follow a normal (bell-shaped) distribution.

* Suitable for continuous numerical data.

* Each feature is modeled using a mean (μ) and standard deviation (σ) per class.


2. **Multinomial Naïve Bayes**

* Assumes features represent counts or frequencies.

* Suitable for discrete, non-negative integer features, such as word counts in text.

* Often used for document classification.


3. **Bernoulli Naïve Bayes**

* Assumes binary features (values are 0 or 1).

* Suitable when we only care about whether a feature is present or absent, not how many times it occurs.

* Models each feature with a Bernoulli distribution.

In [None]:
##Answer 10

# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data        # Features
y = data.target      # Labels (0 = malignant, 1 = benign)

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Initialize the Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Display results
print("Gaussian Naïve Bayes on Breast Cancer Dataset")
print("---------------------------------------------")
print(f"Accuracy: {accuracy:.4f}\n")

# Optional: Detailed performance report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Gaussian Naïve Bayes on Breast Cancer Dataset
---------------------------------------------
Accuracy: 0.9386

Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        42
      benign       0.95      0.96      0.95        72

    accuracy                           0.94       114
   macro avg       0.94      0.93      0.93       114
weighted avg       0.94      0.94      0.94       114

