**Q1** (Gaussian Naïve Bayes Classifier) Implement Gaussian Naïve Bayes Classifier on the Iris dataset from sklearn.datasets using (i) Step-by-step implementation (ii) In-built function

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load and Split Data
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Training: Calculate Mean and Std Dev for each class
classes = np.unique(y_train)
mean_var = {}

for c in classes:
    # Filter data for the current class
    X_c = X_train[y_train == c]
    # Store mean and variance for each feature (column)
    mean_var[c] = {
        'mean': X_c.mean(axis=0),
        'var': X_c.var(axis=0)
    }

# 3. Helper function: Gaussian Probability Density
def calculate_probability(x, mean, var):
    exponent = np.exp(-((x - mean) ** 2) / (2 * var))
    return (1 / np.sqrt(2 * np.pi * var)) * exponent

# 4. Prediction Step
y_pred = []
for x in X_test:
    posteriors = []
    for c in classes:
        # Get stored stats for class c
        mean = mean_var[c]['mean']
        var = mean_var[c]['var']

        # Calculate Prior (P(Class)) - assumed equal or proportional
        prior = np.log(len(X_train[y_train == c]) / len(X_train))

        # Calculate Likelihood (P(Data|Class))
        # We sum logs to avoid underflow issues (Log Likelihood)
        likelihood = np.sum(np.log(calculate_probability(x, mean, var)))

        # Posterior = Prior + Likelihood (in log scale)
        posteriors.append(prior + likelihood)

    # Select class with highest posterior probability
    y_pred.append(classes[np.argmax(posteriors)])

# 5. Evaluation
print("Step-by-Step Implementation Accuracy:", accuracy_score(y_test, y_pred))

Step-by-Step Implementation Accuracy: 1.0


In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# 1. Load Data
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train Model
model = GaussianNB()
model.fit(X_train, y_train)

# 4. Predict and Evaluate
y_pred = model.predict(X_test)
print("In-built Implementation Accuracy:", accuracy_score(y_test, y_pred))

In-built Implementation Accuracy: 1.0


**Q2** Explore about GridSearchCV toot in scikit-learn. This is a tool that is
often used for tuning hyperparameters of machine learning models. Use
this tool to find the best value of K for K-NN Classifier using any dataset

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# 1. Load Data
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split Data (Optional for GridSearch, but good practice to keep a hold-out set)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Setup GridSearchCV
# Define the model
knn = KNeighborsClassifier()

# Define the parameter grid: Test K from 1 to 30
param_grid = {'n_neighbors': list(range(1, 31))}

# Initialize GridSearch
# cv=5 means 5-fold Cross-Validation
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=5, scoring='accuracy')

# 4. Run the search on training data
grid_search.fit(X_train, y_train)

# 5. Results
print("Best Value for K:", grid_search.best_params_['n_neighbors'])
print("Best Cross-Validation Score:", grid_search.best_score_)

# Optional: Validate on the test set with the best model
best_model = grid_search.best_estimator_
print("Test Set Accuracy with Best K:", best_model.score(X_test, y_test))

Best Value for K: 3
Best Cross-Validation Score: 0.9583333333333334
Test Set Accuracy with Best K: 1.0
