# ========================================== THEORY ===================================

# Q1: Explain the working principle of KNN.
"""
**K-Nearest Neighbors (KNN)** is a lazy, instance-based algorithm that classifies a data point based on how its neighbors are classified.
It calculates the distance between the query point and all other points in the dataset, selects the 'k' nearest points, and assigns the most common class among them.
It is best suited for **classification** and **regression problems**, especially when the decision boundary is non-linear.
"""


# Q2: Common Distance Metrics in KNN.
"""
- **Euclidean Distance**: Most common; calculates straight-line distance.  
- **Manhattan Distance**: Sum of absolute differences across dimensions.  
- **Minkowski Distance**: Generalized version of both Euclidean and Manhattan (with parameter `p`).  
- **Hamming Distance**: For categorical variables (e.g., strings or binary features).  
"""


# Q3: Advantages and Limitations of KNN
"""
**Advantages**:
- Simple to understand and implement.
- No training phase ‚Äì instant learning.
- Works well with small datasets.

**Limitations**:
- Computationally expensive for large datasets.
- Sensitive to irrelevant features and scaling.
- Struggles with imbalanced classes.
"""


# Q4: Bayes' Theorem
"""
Bayes' Theorem:  
$$P(A|B) = \\frac{P(B|A) * P(A)}{P(B)}$$  
In Naive Bayes, it helps calculate the probability of a class given the input features, assuming independence between features.
"""


# Q5: What does ‚ÄúNaive‚Äù mean?
"""
The term **"Naive"** assumes that all features are **independent** of each other given the class label.
This simplifies computation and allows scalability, even though it's rarely true in practice.
"""


# Q6: Compare Gaussian, Multinomial, Bernoulli
"""
- **GaussianNB**: Assumes features follow a normal distribution. Used for continuous data (e.g., Iris).
- **MultinomialNB**: Used for count data (e.g., word frequencies in text classification).
- **BernoulliNB**: Binary features (e.g., presence/absence of words).

**Use Cases**:
- Gaussian: Iris dataset (classification by petal/sepal features).
- Multinomial: Spam filtering, document classification.
- Bernoulli: Sentiment analysis (binary presence of certain words).
"""


# Q7: Two Key Differences
"""
1. **Learning Type**:  
   - KNN is a lazy learner (no training)  
   - Naive Bayes is a probabilistic model (trained using statistics)

2. **Performance**:  
   - KNN struggles with high-dimensional data  
   - Naive Bayes handles high-dimensional, sparse data well (e.g., text)
"""


# ========================================== PRACTICAL PART A ===================================

In [1]:
# Import Libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import confusion_matrix



In [2]:
# Load and Split Iris Data
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [3]:
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [4]:
# KNN Training & Evaluation (k=3,5,7)
for k in [3, 5, 7]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    y_pred = knn.predict(X_test_scaled)
    
    print(f"\nKNN with k={k}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))



KNN with k=3
Accuracy: 1.0
Confusion Matrix:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30


KNN with k=5
Accuracy: 1.0
Confusion Matrix:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30


KNN with k=7
Accuracy

# ========================================== PRACTICAL PART B ===================================

In [7]:
# Load dataset and keep only needed columns
df = pd.read_csv("spam.csv", encoding="latin-1")[['Category', 'Message']]
df.columns = ['label', 'message']  # Rename for clarity

# Convert labels to binary
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
# Preprocess Text
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

df['cleaned'] = df['message'].apply(clean_text)


In [10]:
# Vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['cleaned'])
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [11]:
# MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred_mnb = mnb.predict(X_test)

print("MultinomialNB Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_mnb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_mnb))
print("Classification Report:\n", classification_report(y_test, y_pred_mnb))


MultinomialNB Performance:
Accuracy: 0.9551569506726457
Confusion Matrix:
 [[966   0]
 [ 50  99]]
Classification Report:
               precision    recall  f1-score   support

           0       0.95      1.00      0.97       966
           1       1.00      0.66      0.80       149

    accuracy                           0.96      1115
   macro avg       0.98      0.83      0.89      1115
weighted avg       0.96      0.96      0.95      1115



In [12]:
# BernoulliNB
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
y_pred_bnb = bnb.predict(X_test)

print("BernoulliNB Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_bnb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_bnb))
print("Classification Report:\n", classification_report(y_test, y_pred_bnb))


BernoulliNB Performance:
Accuracy: 0.979372197309417
Confusion Matrix:
 [[963   3]
 [ 20 129]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.98      0.87      0.92       149

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



In [15]:
# Interpret Errors
cm = confusion_matrix(y_test, y_pred_mnb)
tn, fp, fn, tp = cm.ravel()
display(f"""
- **False Positives (Ham ‚Üí Spam)**: {fp} ‚Üí Legit messages marked spam.  
- **False Negatives (Spam ‚Üí Ham)**: {fn} ‚Üí Spam messages passed as legit.  

**Ideal Goal**: Minimize false negatives for better spam filtering.
""")


'\n- **False Positives (Ham ‚Üí Spam)**: 0 ‚Üí Legit messages marked spam.  \n- **False Negatives (Spam ‚Üí Ham)**: 50 ‚Üí Spam messages passed as legit.  \n\n**Ideal Goal**: Minimize false negatives for better spam filtering.\n'

In [17]:
# Reload and resplit the Iris dataset to avoid contamination
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load Iris again
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train_scaled, y_train)
y_pred_gnb = gnb.predict(X_test_scaled)

# Evaluation
print("GaussianNB on Iris:")
print("Accuracy:", accuracy_score(y_test, y_pred_gnb))
print("Classification Report:\n", classification_report(y_test, y_pred_gnb))


GaussianNB on Iris:
Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [18]:
# Cross-validation
from sklearn.model_selection import cross_val_score

print("KNN Cross-validation (k=5):", cross_val_score(KNeighborsClassifier(5), X, y, cv=5).mean())
print("GaussianNB Cross-validation:", cross_val_score(GaussianNB(), X, y, cv=5).mean())


KNN Cross-validation (k=5): 0.9733333333333334
GaussianNB Cross-validation: 0.9533333333333334


# =========================================== FINAL SUMMARY ===================================
"""
## ‚úÖ Final Observations:

- **KNN** achieved high accuracy on Iris but is sensitive to feature scaling and not ideal for high-dimensional data.
- **Naive Bayes (Multinomial)** worked well for spam classification due to handling of word frequencies.
- **BernoulliNB** was slightly less accurate but useful for binary word presence.
- **GaussianNB** performed well on Iris, comparable to KNN.

### üîç Conclusion:
Choose **KNN** for structured, numeric data; **Naive Bayes** for text classification or when speed is important.
"""
