more reading: https://www.cs.upc.edu/~mmartin/DM3%20-%20NB%20i%20KNN.pdf

## The Importance of Choosing the Right Classification Algorithm
Choosing the right classification algorithm is crucial as each algorithm has its strengths and weaknesses, and no single algorithm works best for every problem. The choice depends on various factors including the size, quality, and nature of data, the urgency of the task, and what you want to do with the information.

For instance, if interpretability is a priority, you might choose Logistic Regression. If your data has many features, Naive Bayes might be a good option due to its feature independence assumption. If you need a quick and easy solution without much tuning, K-Nearest Neighbors could be the way to go.

## Iris Dataset
The dataset consists of 150 samples of iris flowers, each belonging to one of three species: Setosa, Versicolor, or Virginica. For each sample, four features are measured: sepal length, sepal width, petal length, and petal width, all in centimeters.

Features:

Sepal Length: Length of the sepals (the outer parts of the flower that protect the flower when it is in bud) in centimeters.

Sepal Width: Width of the sepals in centimeters.

Petal Length: Length of the petals (the inner parts of the flower that produce color) in centimeters.

Petal Width: Width of the petals in centimeters.

Classes (Target Variable):

Setosa: Iris setosa is one of the three species of iris flowers. It is characterized by its smaller size and distinctive appearance. class 0

Versicolor: Iris versicolor is another species of iris flowers. It is larger than setosa and has different characteristics. class 1

Virginica: Iris virginica is the third species of iris flowers. It is typically larger and has different characteristics compared to setosa and versicolor. class 2

## K-Nearest Neighbors (KNN)
One of the simplest, yet highly effective classification algorithms is the K-Nearest Neighbors (KNN) algorithm. KNN belongs to the family of instance-based, competitive learning, and lazy learning algorithms.

**Instance-based** means that KNN does not create a model from the training data but instead uses the training instances (or observations) themselves in the classification or prediction process.

The term **competitive learning** refers to the fact that for a new, unseen instance, an ‘election’ among candidate training instances takes place. Those candidates compete to ‘claim’ the unseen instance as part of their class.

This makes K-Nearest a non-parametric model, does not make any assumptions in the data.Non-parametric algorithms do not make explicit assumptions about the functional form of the underlying data distribution. Instead, they rely on the data itself to determine the model complexity.

KNN is described as a lazy learning algorithm because it does not ‘learn’ from the training data during the training phase. Unlike most other machine learning algorithms, which construct a generalization model during the training phase, KNN does virtually no computation in the training phase. The real computational work of KNN happens during the testing phase when classifications are made for unseen instances.
The ‘K’ in KNN is a parameter that refers to the number of nearest neighbors to include in the majority voting process.

**How it works**

The core idea behind KNN is the concept of ‘distance’ in the feature space. The algorithm calculates the distance between the new observation (the point we want to classify) and all the existing data points. The ‘K’ in KNN represents the number of nearest neighbors the algorithm considers when it classifies a new observation. If K=1, the algorithm assigns the class of the nearest neighbor to the new observation. If K=3, it considers the three nearest neighbors, and the new observation is assigned to the class that has the majority among these three neighbors. The selection of the ‘K’ value is crucial and usually chosen through cross-validation.

- Calculate Distances: KNN first calculates the distance between the new observation and every other observation in the training set. 
- Find Nearest Neighbors: The algorithm then sorts these calculated distances in ascending order and selects the ‘K’ instances (neighbors) closest to the new observation.
- Classify New Observation: Finally, the algorithm assigns the new observation to the class that has the majority among the K neighbors.

The ‘K’ in KNN is a hyperparameter that you choose as the data scientist. It determines the number of neighbors to consider when making the classification. If K is too small, the model might be overly sensitive to noise in the data; 

In [1]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report


# Load Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Feature scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the classifier to the data
knn.fit(X_train, y_train)

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Output confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[11  0  0]
 [ 0 13  0]
 [ 0  0  6]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00         6

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



#### Interpretation

**Confusion Matrix:**

Each row of the confusion matrix represents the true class, while each column represents the predicted class.

(0,0): The classifier correctly predicted 11 instances that belong to class 0 .

(1,1): The classifier correctly predicted 13 instances that belong to class 1 .

(2,2): The classifier correctly predicted 6 instances that belong to class 2 .

In this case, the diagonal elements of the confusion matrix are all non-zero, indicating that the model correctly predicted all instances for each class. There are no false positives or false negatives.

**Precision, Recall, F1-score, and Support:**

Precision measures the proportion of instances predicted as positive that are actually positive. A score of 1.00 indicates perfect precision for all classes, meaning there were no false positives.

Recall (also known as sensitivity) measures the proportion of actual positives that were correctly predicted as positive. Again, a score of 1.00 indicates perfect recall for all classes, meaning there were no false negatives.

F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. A score of 1.00 indicates perfect balance between precision and recall for all classes.
Support indicates the number of instances for each class in the test dataset.

**Accuracy:**

The overall accuracy of the model is 100%, meaning that all instances in the test dataset were correctly classified.

In summary, the KNN model achieved perfect classification performance across all metrics, indicating that it correctly classified all instances in the test dataset for each class. This suggests that the model generalizes well to unseen data and performs exceptionally well in this particular classification task.

### Advantages and Disadvantages of KNN

**Strengths of KNN:**

Simplicity: KNN is conceptually straightforward and easy to understand. The algorithm’s logic — classifying an instance based on its similarity to other instances — is intuitive.

No Training Phase: Since KNN is a lazy learning algorithm, it doesn’t learn a model. This makes the training phase very fast (all it needs to do is store the dataset).

Adaptability: KNN is a non-parametric algorithm, which means it makes no explicit assumptions about the shape of the function mapping inputs to outputs. This makes KNN adaptable and able to model complex decision boundaries.

**Limitations of KNN:**

Computational Intensity: As a lazy learning algorithm, KNN does all its computation at prediction time. This can be very computationally intensive, especially with large datasets.

Sensitive to Irrelevant Features: KNN treats all features equally, which can be a problem if some features are irrelevant. Irrelevant or redundant features can negatively impact the performance of KNN.

Choice of K and Distance Metric: The choice of the number of neighbors (K) and the distance metric are critical and can significantly affect the performance of KNN. These parameters typically need to be determined through cross-validation, which can be computationally expensive.

Performance with Imbalanced Data: KNN can perform poorly with imbalanced data. If one class has significantly more instances than another, KNN is likely to classify new instances based on the majority class, irrespective of the feature values.

## Naive Bayes

Bayes’ Theorem provides a way to calculate the probability of a data point belonging to a particular class, given our prior knowledge. In the context of classification, this can be thought of as the probability of a class (or category) given a set of features, which is the essence of a Naive Bayes classifier.

The term ‘naive’ comes from the algorithm’s underlying assumption of independence between every pair of features. This assumption is ‘naive’ because it’s seldom true in real-world data — features often influence each other. However, even with this naive assumption, the algorithm often performs well and can be particularly effective in large datasets.

The Naive Bayes algorithm is based on applying Bayes’ theorem, which is a formula describing how to update probabilities based on new data. In the context of classification, it calculates the conditional probability of a class C, given predictor variable X.

The effect of the value of a predictor X on a given class C is independent of the values of other predictors.

We’re interested in finding the probability of a class C given a set of features (X).

The ‘naive’ in Naive Bayes comes from the assumption that all features in X (let’s say X1, X2,…,Xn) are mutually independent given the class C, and we’re interested in finding the class with the highest probability for a given set of features.

### Assumptions Made in Naive Bayes

**Class Conditional Independence:** The algorithm assumes that predictors are independent of each other given the class. In other words, the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.

**Equal Importance of Features:** Every feature is given the same weight or importance in the Naive Bayes algorithm. The algorithm doesn’t learn which features are more important in the classification.


### Variants of Naive Bayes classifiers

**Gaussian Naive Bayes (GaussianNB):**

Assumes that features follow a Gaussian distribution (normal distribution).
Suitable for continuous features.

**Multinomial Naive Bayes (MultinomialNB):**

Suitable for discrete features, such as word counts in text classification.
Commonly used in document classification tasks.

**Bernoulli Naive Bayes (BernoulliNB):**

Similar to Multinomial Naive Bayes but assumes that features are binary (e.g., presence or absence of a feature).
Often used in text classification with binary feature vectors.


**Why use GaussianNB**

For the Iris dataset, which contains continuous numerical features, the most appropriate Naive Bayes classifier to use is the Gaussian Naive Bayes (GaussianNB).

Here's why GaussianNB is a suitable choice for the Iris dataset:

Continuous Features:

The Iris dataset contains continuous numerical features (sepal length, sepal width, petal length, and petal width) that follow a Gaussian (normal) distribution.
GaussianNB assumes that the features within each class are normally distributed, making it well-suited for datasets with continuous features.
Robustness to Irrelevant Features:

GaussianNB is robust to irrelevant features and can handle datasets with a large number of features.
While the Iris dataset has only four features, GaussianNB's robustness to irrelevant features ensures it can handle additional features without significantly impacting performance.
Simplicity and Efficiency:

GaussianNB is simple and computationally efficient, making it suitable for small to medium-sized datasets like Iris.
Its simplicity makes it easy to implement and interpret, making it an ideal choice for introductory machine learning tasks.

In [2]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, classification_report

# Load Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Create Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Fit the classifier to the data
gnb.fit(X_train, y_train)

# Make predictions on the test data
y_pred = gnb.predict(X_test)

# Output confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[11  0  0]
 [ 0 12  1]
 [ 0  0  6]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        13
           2       0.86      1.00      0.92         6

    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



### Interpretation

**Confusion Matrix:**

Each row of the confusion matrix represents the true class, while each column represents the predicted class.

(0,0): The classifier correctly predicted 11 instances that belong to class 0 .

(1,1): The classifier correctly predicted 12 instances that belong to class 1 .

(1,2): The classifier incorrectly predicted 1 instance as class 2, which actually belongs to class 1 .

(2,2): The classifier correctly predicted 6 instances that belong to class 2 .

The diagonal elements of the confusion matrix are all non-zero, indicating that the model correctly predicted most instances for each class. There are very few misclassifications.
Precision, Recall, F1-score, and Support:

Precision measures the proportion of instances predicted as positive that are actually positive. A score of 1.00 indicates perfect precision for class 0 and class 1, meaning there were no false positives for these classes. For class 2, precision is slightly lower at 0.86, indicating that some instances predicted as class 2 were actually not class 2.

Recall (also known as sensitivity) measures the proportion of actual positives that were correctly predicted as positive. Recall is perfect (1.00) for class 0 and class 2, meaning there were no false negatives for these classes. For class 1, recall is slightly lower at 0.92, indicating that some instances of class 1 were not correctly identified by the model.

F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. It is high for all classes, indicating good overall performance in terms of both precision and recall.

Support indicates the number of instances for each class in the test dataset.


## Comparing the performance of the two models:

**KNN Model:**

Achieved perfect precision, recall, and F1-score for all classes.

Produced a confusion matrix where all diagonal elements are non-zero, indicating no misclassifications.

Attained an accuracy of 100%.

**Naive Bayes Model:**

Achieved slightly lower precision, recall, and F1-score for some classes compared to KNN.
Produced a confusion matrix with very few misclassifications.
Attained an accuracy of 97%.
While both models performed exceptionally well, the KNN model outperformed the Naive Bayes model in terms of accuracy and achieving perfect scores for precision, recall, and F1-score. Therefore, based on these metrics, the KNN model performed best for this particular classification task.

However, it's essential to consider other factors such as computational complexity, scalability, and interpretability when choosing the best model for deployment in real-world scenarios.

### Advantages and Disadvantages of Naive Bayers

**Strengths of Naive Bayes:**

Efficiency: Naive Bayes classifiers are incredibly fast compared to more sophisticated methods. This is because they decouple the class conditional feature distributions, so you can independently estimate each feature’s distribution and then multiply them together to obtain the required result.

Simplicity: Naive Bayes classifiers are easy to implement and understand. They are a good choice if you want to build a baseline model to benchmark more complex models.

Performance: Despite their simplicity, Naive Bayes classifiers often perform surprisingly well and are widely used for text classification and spam filtering.

Handling Categorical Features: Naive Bayes handles categorical features well and is not affected by irrelevant features.

**Limitations of Naive Bayes:**
Independence Assumption: The most significant limitation of Naive Bayes is the assumption of feature independence. This is a strong assumption and unrealistic for real data; nevertheless, Naive Bayes classifiers perform very well on complex real-world problems, even when this assumption isn’t valid.

Zero Frequency: If a category of a categorical variable is not observed in the training set, then the model will assign a zero probability to that category and will be unable to make a prediction. This is often known as “Zero Frequency.” To solve this, we can use the smoothing technique, where we assign a small fraction of probability to all categories.

Continuous Features: While Naive Bayes handles categorical features well, it doesn’t perform as well with continuous features. This is because it assumes a normal distribution for these features, which is rarely the case with real-world data.