## Class 8 - Part 2
### Classification
This practical lesson focuses on the application of classification algorithms using the Red-Wine-Quality dataset in Python with the *scikit-learn* library. The primary goal is to teach students fundamental machine learning concepts and techniques through hands-on experience in training, testing, and evaluating different classifiers.

Three classification models are introduced and compared:
1. **K-Nearest Neighbors (KNN)**: This model, configured with three neighbors, serves as an example of instance-based learning, where predictions are based on the nearest training examples in the feature space.
2. **Support Vector Machine (SVM)**: Using a linear kernel, this model demonstrates the concept of maximizing the margin between different classes, which is key to enhancing model generalization.
3. **Naive Bayes**: This classifier introduces probabilistic modeling, particularly Gaussian Naive Bayes, which assumes that features are independent and normally distributed.

Each classifier is trained on the training data and then tasked with making predictions on the test set. The accuracy of these predictions is calculated and compared, providing practical insights into the effectiveness and suitability of each method under various conditions.

In [0]:
# Import necessary libraries for data handling, machine learning model building, preprocessing, evaluation, and visualization.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import numpy as np

In [0]:
# Load the Red-Wine-Quality datasetfrom a CSV file
wine = pd.read_csv('/dbfs/FileStore/CDS2024/winequality_red.csv')

In [0]:
# Red-Wine-Quality DataFrame dimensionality
wine.shape

In [0]:
# Display the first five rows of the Red-Wine-Quality DataFrame to get an initial overview of the data structure and values.
wine.head()

### Data Description:
Red Wine quality classification Model The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones). Content For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests): 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol Output variable (based on sensory data): 12 - quality (**score between 0 and 10**).

In [0]:
# Create a pair plot of the Wine Quality dataset features, differentiating data points by wine quality
# 'vars' specifies the dataset features to include in the plot, excluding the last column
# 'hue' sets the column 'quality' as the category for color coding, allowing visualization of data by quality levels
# 'palette' is set to 'tab10', which offers a distinct set of 10 colors for better visual differentiation
sns.pairplot(wine, vars=wine.columns[:-1], hue="quality", palette='tab10')
# Add a title to the plot for better understanding and presentation
plt.suptitle('Pair Plot of Wine Quality Dataset Features by Quality', size=16, y=1.02)
# Display the plot
plt.show()

In [0]:
# Split the wine dataset into training and testing sets.
# Features (X) are obtained by dropping the 'quality' column, which is the target variable.
# Targets (y) are extracted from the 'quality' column.
# The dataset is split such that 30% of the data is used for testing and 70% for training.
# 'random_state=42' ensures that the split is reproducible, meaning the same random split will occur each time the code is run.
X_train, X_test, y_train, y_test = train_test_split(wine.drop('quality', axis=1), wine['quality'], test_size=0.3, random_state=42)
print("Dimensions of X_train:", X_train.shape)
print("Dimensions of X_test:", X_test.shape)
print("Dimensions of y_train:", y_train.shape)
print("Dimensions of y_test:", y_test.shape)

In [0]:
# Initialize the k-Nearest Neighbors classifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# Make predictions on the test data
knn_predictions = knn.predict(X_test)
# Calculate and print the accuracy of the model on the test data
knn_accuracy = accuracy_score(y_test, knn_predictions)
print(f"KNN Accuracy: {knn_accuracy:.2f}")

In [0]:
# Initialize the Support Vector Machine classifier with a linear kernel
svm = SVC(kernel='linear')
# Fit the classifier to the training data
svm.fit(X_train, y_train)
# Make predictions on the test data
svm_predictions = svm.predict(X_test)
# Calculate and print the accuracy of the model on the test data
svm_accuracy = accuracy_score(y_test, svm_predictions)
print(f"SVM Accuracy: {svm_accuracy:.2f}")

In [0]:
# Initialize the Naive Bayes classifier using the Gaussian distribution
nb = GaussianNB()
# Fit the classifier to the training data
nb.fit(X_train, y_train)
# Make predictions on the test data
nb_predictions = nb.predict(X_test)
# Calculate and print the accuracy of the model on the test data
nb_accuracy = accuracy_score(y_test, nb_predictions)
print(f"Naive Bayes Accuracy: {nb_accuracy:.2f}")

In [0]:
# Compare the accuracy of the three classifiers
print("\nAccuracy Summary:")
print(f"KNN: {knn_accuracy:.2f}, SVM: {svm_accuracy:.2f}, Naive Bayes: {nb_accuracy:.2f}")

In [0]:
# Selecting the best model based on accuracy
best_accuracy = max(knn_accuracy, svm_accuracy, nb_accuracy)
best_model = 'KNN' if best_accuracy == knn_accuracy else 'SVM' if best_accuracy == svm_accuracy else 'Naive Bayes'
print(f"Best performing model: {best_model} with accuracy of {best_accuracy:.2f}")

---------------------------------------------------------------------------------------------------------------------------

Author: <b>Julio Iglesias</b>