# Supervised Classification — Decision Trees, SVM, and Naive Bayes

**Note:** No personal data has been included.

## Q1: What is Information Gain, and how is it used in Decision Trees?

- **Information Gain (IG)** measures the reduction in entropy (uncertainty) about the target variable after splitting the data on a feature.
- It is computed as the difference between the entropy of the parent node and the weighted sum of entropies of child nodes after the split.
- Decision trees use Information Gain (or related metrics like Gain Ratio) to choose which feature and threshold to split on that yields the largest IG (most reduction in impurity).


## Q2: What is the difference between Gini Impurity and Entropy?

- **Entropy** (from information theory) measures impurity as $-\sum p_i \log_2 p_i$; it is sensitive to class probabilities and used with Information Gain.
- **Gini Impurity** is $1 - \sum p_i^2$ and measures the probability of misclassification if a label is randomly assigned according to class distribution.
- **Differences & use-cases:** Gini is slightly faster to compute and often yields similar trees to Entropy; Entropy has stronger information-theoretic interpretation. Practically, both work well; choice may be empirical.


## Q3: What is Pre-Pruning in Decision Trees?

- **Pre-pruning** (early stopping) halts tree growth during training by applying stopping criteria (max depth, min samples per leaf, min impurity decrease, etc.).
- Purpose: prevent overfitting by limiting complexity before the tree perfectly fits training data.
- Trade-off: may underfit if pruning is too aggressive; selection of pre-pruning hyperparameters typically uses validation data or cross-validation.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split

wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

print('Feature importances:')
for name, imp in zip(wine.feature_names, clf.feature_importances_):
    print(f'{name}: {imp:.6f}')

Feature importances:
alcohol: 0.000000
malic_acid: 0.019574
ash: 0.021419
alcalinity_of_ash: 0.022219
magnesium: 0.000000
total_phenols: 0.000000
flavanoids: 0.410802
nonflavanoid_phenols: 0.000000
proanthocyanins: 0.000000
color_intensity: 0.403317
hue: 0.000000
od280/od315_of_diluted_wines: 0.022351
proline: 0.100318


## Q5: What is a Support Vector Machine (SVM)?

- An SVM is a supervised learning model used for classification (and regression) that finds a hyperplane maximizing the margin between classes.
- The support vectors are the training samples closest to the decision boundary; they define the position of the boundary.


## Q6: What is the Kernel Trick in SVM?

- The kernel trick allows SVMs to operate in high-dimensional (possibly infinite-dimensional) feature spaces without explicitly computing coordinates in that space.
- A kernel function computes dot products in the transformed feature space, enabling non-linear decision boundaries (e.g., RBF, polynomial kernels).


In [None]:
from sklearn.svm import SVC
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

svc_lin = SVC(kernel='linear', random_state=42).fit(X_train, y_train)
svc_rbf = SVC(kernel='rbf', random_state=42).fit(X_train, y_train)

print('Accuracy (Linear SVM):', accuracy_score(y_test, svc_lin.predict(X_test)))
print('Accuracy (RBF SVM):', accuracy_score(y_test, svc_rbf.predict(X_test)))

Accuracy (Linear SVM): 0.955556
Accuracy (RBF SVM): 0.711111


## Q8: What is the Naïve Bayes classifier, and why is it called 'Naïve'?

- Naïve Bayes is a family of probabilistic classifiers based on Bayes' theorem, assuming feature independence given the class label.
- It is called 'naïve' because it assumes conditional independence among features—a simplification that often works well in practice despite being unrealistic.


## Q9: Differences between Gaussian, Multinomial, and Bernoulli Naïve Bayes

- **GaussianNB:** Assumes features are continuous and normally distributed within each class (useful for real-valued features).
- **MultinomialNB:** Models feature counts (e.g., word counts) using multinomial distribution; commonly used in text classification.
- **BernoulliNB:** Models binary/boolean features (presence/absence), suitable when features are binary indicators.


In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

bc = load_breast_cancer()
X, y = bc.data, bc.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

gnb = GaussianNB().fit(X_train, y_train)
print('Accuracy (GaussianNB) on test data:', accuracy_score(y_test, gnb.predict(X_test)))

Accuracy (GaussianNB) on test data: 0.937063


**End of notebook.**