## Class 8 - Part 1
### Classification
This practical lesson focuses on the application of classification algorithms using the Iris dataset in Python with the *scikit-learn* library. The primary goal is to teach students fundamental machine learning concepts and techniques through hands-on experience in training, testing, and evaluating different classifiers.

Three classification models are introduced and compared:
1. **K-Nearest Neighbors (KNN)**: This model, configured with three neighbors, serves as an example of instance-based learning, where predictions are based on the nearest training examples in the feature space.
2. **Support Vector Machine (SVM)**: Using a linear kernel, this model demonstrates the concept of maximizing the margin between different classes, which is key to enhancing model generalization.
3. **Naive Bayes**: This classifier introduces probabilistic modeling, particularly Gaussian Naive Bayes, which assumes that features are independent and normally distributed.

Each classifier is trained on the training data and then tasked with making predictions on the test set. The accuracy of these predictions is calculated and compared, providing practical insights into the effectiveness and suitability of each method under various conditions.

In [0]:
# Import necessary libraries for data handling, machine learning model building, preprocessing, evaluation, and visualization.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Load the Iris dataset from the scikit-learn datasets module
from sklearn.datasets import load_iris

In [0]:
# Load the Iris dataset into the variable 'data'
data = load_iris()
# Create a DataFrame from the Iris dataset, setting the columns to the feature names provided in the dataset
iris = pd.DataFrame(data=data.data, columns=data.feature_names)
# Add a new column 'target' to the Iris DataFrame, containing the target labels from the dataset
iris['target'] = data.target

In [0]:
# Iris DataFrame dimensionality
iris.shape

In [0]:
# Display the first five rows of the Iris DataFrame to get an initial overview of the data structure and values.
iris.head()

### Data Description:
The Iris dataset is one of the most well-known datasets in the field of machine learning. It was introduced by British statistician and biologist Ronald Fisher in 1936 as an example of linear discriminant analysis. This dataset is often used for classification tasks and testing machine learning algorithms.

The Iris dataset consists of 150 instances of iris flowers, each with four attributes: sepal length, sepal width, petal length, and petal width. All measurements are in centimeters. These data are used to classify the instances into one of three species or classes of iris, which are:
 - Iris-setosa (0)
 - Iris-versicolor (1)
 - Iris-virginica(2)

![Iris](https://www.math.umd.edu/~petersd/666/html/iris_with_labels.jpg)

In [0]:
# Create a pair plot of the Iris dataset features, differentiating the data points by class
# 'hue' is set to 'target' to color data points based on the Iris species
# 'vars' specifies the dataset features to include in the plot
# 'markers' assigns different markers to each class for better visual distinction
# Specify a custom color palette for clearer distinction among the classes
sns.pairplot(iris, hue='target', vars=data.feature_names, markers=["o", "s", "D"], palette='bright')
# Add a title above the pair plot with a specific font size and adjusted position
plt.suptitle('Pair Plot of Iris Dataset Features by Class', size=16, y=1.02)
# Display the plot
plt.show()

In [0]:
# Drop the 'petal length (cm)' feature
#iris = iris.drop('petal width (cm)', axis=1)

In [0]:
# Split the dataset into training and testing sets. 
# Features (X) are obtained by dropping the 'target' column, and targets (y) are the 'target' column.
# 30% of the data is reserved for testing, and the split is reproducible with a random state set to 22.
X_train, X_test, y_train, y_test = train_test_split(iris.drop('target', axis=1), iris['target'], test_size=0.3, random_state=22)
print("Dimensions of X_train:", X_train.shape)
print("Dimensions of X_test:", X_test.shape)
print("Dimensions of y_train:", y_train.shape)
print("Dimensions of y_test:", y_test.shape)

In [0]:
# Initialize the k-Nearest Neighbors classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# Make predictions on the test data
knn_predictions = knn.predict(X_test)
# Calculate and print the accuracy of the model on the test data
knn_accuracy = accuracy_score(y_test, knn_predictions)
print(f"KNN Accuracy: {knn_accuracy:.2f}")

In [0]:
# Initialize the Support Vector Machine classifier with a linear kernel
svm = SVC(kernel='linear')
# Fit the classifier to the training data
svm.fit(X_train, y_train)
# Make predictions on the test data
svm_predictions = svm.predict(X_test)
# Calculate and print the accuracy of the model on the test data
svm_accuracy = accuracy_score(y_test, svm_predictions)
print(f"SVM Accuracy: {svm_accuracy:.2f}")

In [0]:
# Initialize the Naive Bayes classifier using the Gaussian distribution
nb = GaussianNB()
# Fit the classifier to the training data
nb.fit(X_train, y_train)
# Make predictions on the test data
nb_predictions = nb.predict(X_test)
# Calculate and print the accuracy of the model on the test data
nb_accuracy = accuracy_score(y_test, nb_predictions)
print(f"Naive Bayes Accuracy: {nb_accuracy:.2f}")

In [0]:
# Compare the accuracy of the three classifiers
print("\nAccuracy Summary:")
print(f"KNN: {knn_accuracy:.2f}, SVM: {svm_accuracy:.2f}, Naive Bayes: {nb_accuracy:.2f}")

In [0]:
# Selecting the best model based on accuracy
best_accuracy = max(knn_accuracy, svm_accuracy, nb_accuracy)
best_model = 'KNN' if best_accuracy == knn_accuracy else 'SVM' if best_accuracy == svm_accuracy else 'Naive Bayes'
print(f"Best performing model: {best_model} with accuracy of {best_accuracy:.2f}")

---------------------------------------------------------------------------------------------------------------------------

Author: <b>Julio Iglesias</b>