# Introduction

Wine, an age-old beverage, is produced through the fermentation of grapes and comes in various types, most notably red and white. The differentiation between these types is primarily due to the grape varieties used and the winemaking process, particularly the fermentation method (Jackson, 2020). Red wines are fermented with grape skins, imparting their characteristic color and tannins, while white wines are typically fermented without skins, resulting in a lighter color and flavor profile (Ribéreau-Gayon et al., 2006).

The quality of wine is influenced by a myriad of physicochemical properties such as acidity, sugar level, alcohol content, sulphates, and pH level. These attributes significantly impact the taste, aroma, and overall sensory experience of the wine (Cortez et al., 2009). Understanding these factors and their variations is crucial for winemakers aiming to produce high-quality wines that meet consumer preferences.

## **Objectives**

This project aims to achieve three primary objectives:

1. **To determine the essential physicochemical properties that affect the quality of wine:** By analyzing various physicochemical attributes, we aim to identify which factors most significantly influence wine quality.
2. **To build regression models for predicting wine quality:** The objective is to create and optimize regression models that can accurately predict the numerical quality score of wines using their physicochemical attributes.
3. **To build classification models for predicting wine type:** The objective is to create classification models that can reliably identify red and white wines based on quantitative properties, including acidity, alcohol content, and sugar levels.

## **Purpose of Predictive Models**

The purpose of the predictive models is to provide winemakers and industry professionals with tools to evaluate and predict wine quality based on its physicochemical properties. This can aid in quality control, product development, and market positioning by allowing for the assessment of wine quality before it reaches consumers.

## **Steps to Accomplish Objectives**

1. **Data Collection and Preprocessing:**
   - Combine winequality-red and winequality-white datasets.
   - Clean the data and create a binary column (`type_bin`) to differentiate red (1) and white (0) wines.

2. **Data Visualization and Exploratory Data Analysis (EDA):**
   - **Histograms:** Created for various physicochemical variables like fixed acidity, volatile acidity, citric acid, etc., with differentiation between red and white wine types.
   - **Bar Charts:** Illustrate the distribution of wine quality across different wine types.
   - **Grouped Bar Plots:** Comparing different chemical attributes that determine wine taste, such as fixed acidity, alcohol, and residual sugar.
   - **Box Plot:** Focus on the distribution of citric acid across different wine types.
   - **Scatter Plot:** Analyze the relationship between selected pairs of variables.
   - **Correlation Heatmap:** Visualize the correlation matrix to identify relationships between various physicochemical properties.

3. **Feature Importance Analysis:**
   - Use Random Forest to determine the importance of physicochemical properties affecting wine quality.

4. **Regression Modeling for Quality Prediction:**
   - **Random Forest Regression:** Train and evaluate the model, and visualize feature importance.
   - **XGBoost Regression:** Train and optimize the model, and evaluate using metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).
   - **Decision Tree Regression:** Train the model, visualize the decision tree, and evaluate its performance.

5. **Classification Modeling for Wine Type Prediction:**
   - **Support Vector Machine (SVM):** Train and evaluate the model using metrics like precision, recall, and F1 score.
   - **k-Nearest Neighbors (k-NN):** Train and evaluate the model using the same metrics.
   - **Comparative Analysis:** Compare the performance of Decision Tree, SVM, and k-NN models.

6. **Model Evaluation and Comparison:**
   - Create a data frame with the performance metrics of all models.
   - Generate a comparison table to analyze the effectiveness of each model.

## **Libraries Used**

To implement the steps above, we will use the following Python libraries:

- **Pandas:** For data manipulation and preprocessing.
- **NumPy:** For numerical computations.
- **Matplotlib and Seaborn:** For data visualization.
- **Scikit-learn:** For machine learning models and evaluation.
- **XGBoost:** For gradient boosting models.
- **SciPy:** For statistical analysis.

This comprehensive analysis will not only identify key physicochemical properties influencing wine quality but also develop robust predictive models for wine quality and type classification. The insights gained from this study can significantly benefit winemakers in quality control and product development, ultimately enhancing the wine production process.

# 1 Data Preparation

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import confusion_matrix, classification_report, mean_squared_error, mean_absolute_error, accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore")

# Listing Files in a Directory
import os
for dirname, _, filenames in os.walk('/kaggle/input/wine-quality-data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load the Data
red_wine = pd.read_csv('/kaggle/input/wine-quality-data/winequality-red.csv', sep=';')
white_wine = pd.read_csv('/kaggle/input/wine-quality-data/winequality-white.csv', sep=';')

# Add Type Bin Columns
red_wine['type_bin'] = 1
white_wine['type_bin'] = 0

# Combine data
wine_data = pd.concat([red_wine, white_wine], axis=0).reset_index(drop=True)

# Display the first few rows of each dataset
print("Red Wine Data:")
print(red_wine.head())
print("\nWhite Wine Data:")
print(white_wine.head())

# 2 Exploratory Data Analysis EDA

In [None]:
# Check for missing values
print("Red Wine Missing Values:")
print(red_wine.isnull().sum())
print("\nWhite Wine Missing Values:")
print(white_wine.isnull().sum())

# Get summary statistics
print("Red Wine Summary Statistics:")
print(red_wine.describe())
print("\nWhite Wine Summary Statistics:")
print(white_wine.describe())

# Check unique values in quality for both datasets
print("Red Wine Quality Values:")
print(red_wine['quality'].unique())

print("White Wine Quality Values:")
print(white_wine['quality'].unique())


### Missing Values and Summary Statistics:

No missing values were found. 

The Summary Statistics reveals key differences between red and white wines in terms of their chemical attributes, with red wines generally having higher acidity and lower residual sugar compared to white wines. 
The feature importance analysis can help identify which attributes are most influential in determining wine quality.

### Histograms

In [None]:
# Histograms 
features = wine_data.columns[:-2]  # Exclude 'quality' and 'type_bin'
for feature in features:
    plt.figure(figsize=(10, 6))
    sns.histplot(data=wine_data, x=feature, hue='type_bin', kde=True)
    plt.title(f'Histogram of {feature}')
    plt.show()

**Histograms:** The histograms provide a visual distribution of each feature across red and white wines. For instance:

* Fixed Acidity: Red wines generally have higher fixed acidity, contributing to their more robust flavor (Jackson, 2014).
* Alcohol: White wines often have higher alcohol content, which can affect sweetness and body (Robinson, 2019).
* Residual Sugar: White wines exhibit a wider range of residual sugar, often higher than in red wines.

These histograms help identify the differences in the distribution of key attributes between the two types of wine, which may impact their quality and sensory profile.

### Bar Chart of Wine Quality by Type

In [None]:
# Bar chart of wine quality by type
plt.figure(figsize=(10, 6))
sns.countplot(x='quality', hue='type_bin', data=wine_data)
plt.title('Wine Quality Distribution by Type')
plt.show()

### Quality Distribution:

* Red Wine: Quality ratings range from 3 to 8, with a concentration around 5 and 6.

* White Wine: Quality ratings range from 3 to 9, with a concentration around 5, 6, and 7.

The distribution of wine quality reveals that both red and white wines have a mix of quality ratings, with the majority being clustered around the mid-range values.

### Grouped Bar Plot of Selected Features


In [None]:
# Grouped bar plot
group_features = ['fixed acidity', 'alcohol', 'residual sugar']
for feature in group_features:
    plt.figure(figsize=(10, 6))
    sns.barplot(x='type_bin', y=feature, data=wine_data)
    plt.title(f'{feature} by Wine Type')
    plt.show()

* Fixed Acidity: Higher in red wines, contributing to their flavor profile (Jackson, 2014).
* Alcohol: Higher in white wines, affecting sweetness and body (Robinson, 2019).
* Residual Sugar: Generally lower in red wines, influencing taste (Jackson, 2014).

These features are selected due to their significant influence on wine quality and sensory attributes. The grouped bar plot helps visualize these differences.

In [None]:
# Box-plot for citric acid
plt.figure(figsize=(10, 6))
sns.boxplot(x='type_bin', y='citric acid', data=wine_data)
plt.title('Distribution of Citric Acid by Wine Type')
plt.show()

**Box Plot**

The box plot shows the distribution of citric acid, which affects the wine’s acidity and freshness. Citric acid can enhance the balance of flavors, contributing to the overall quality of the wine (Jackson, 2014). The plot indicates that citric acid levels vary between red and white wines, which can impact the sensory profile.

In [None]:
# Scatter plots 
scatter_features = [('fixed acidity', 'pH'), ('alcohol', 'density')]
for x_feature, y_feature in scatter_features:
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x=x_feature, y=y_feature, hue='type_bin', data=wine_data)
    plt.title(f'{x_feature} vs {y_feature}')
    plt.show()
    

In [None]:
    # Extract data used for plotting
    scatter_data = wine_data[[x_feature, y_feature, 'type_bin']]
    
    # Print data for analysis
    print(f"Data for {x_feature} vs {y_feature}:")
    print(scatter_data.head())  # Print first few rows for a preview
    # If you want to save this data to a file for further analysis:
    scatter_data.to_csv(f"{x_feature}_vs_{y_feature}_data.csv", index=False)

* Fixed Acidity vs. pH: The scatter plot shows a negative correlation between fixed acidity and pH, indicating that as the fixed acidity of the wine increases, its pH value decreases. This suggests that wines with higher acidity have lower pH levels, making them more acidic. This relationship is important because pH is a key factor in determining the taste and preservation qualities of the wine. A lower pH usually corresponds to a more acidic taste, which is often desired in certain wine styles (Robinson, 2019).

* Alcohol vs. Density: The scatter plot reveals a negative correlation between alcohol and density. This implies that as the alcohol content increases, the density of the wine tends to decrease. The underlying reason for this trend is that alcohol is less dense than water; thus, higher alcohol concentrations reduce the overall density of the wine. This information is valuable for understanding how alcohol levels impact the physical properties of the wine, which can affect its mouthfeel and overall quality (Jackson, 2014).

These relationships are crucial for understanding how certain attributes interact and affect wine quality.

In [None]:
# Correlation Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(wine_data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Wine Attributes')
plt.show()

### Correlation Analysis Insights

####  **Key Correlations:**

**1. Density and Alcohol (-0.687)**

There is a strong negative correlation between `Density` and `Alcohol`. This indicates that as the alcohol content in the wine increases, its density tends to decrease. This is consistent with the fact that alcohol is less dense than water and contributes less to the overall density of the wine. As the alcohol content increases, the relative proportion of denser substances (like sugars) decreases, leading to a lower density overall (Ough & Amerine, 1988).


**2. Chlorides and Volatile Acidity (0.377)**

There is a moderate positive correlation between `Chlorides` and `Volatile Acidity`. This suggests that wines with higher chloride levels tend to have higher volatile acidity. This could be due to the impact of chlorides on the stability of the wine and its tendency to develop higher levels of volatile acids during fermentation or aging (Jackson, 2008).


**3. Residual Sugar and Total Sulfur Dioxide (0.496)**

A moderate positive correlation between `Residual Sugar` and `Total Sulfur Dioxide` indicates that wines with higher residual sugar levels often have higher total sulfur dioxide. This correlation likely reflects the need for higher sulfur dioxide levels to prevent spoilage in sweeter wines, which are more susceptible to microbial activity (Boulton et al., 1996).


**4. Volatile Acidity and Fixed Acidity (0.219)**

There is a weak positive correlation between `Volatile Acidity` and `Fixed Acidity`. This suggests that as volatile acidity increases, fixed acidity tends to increase slightly. This can be due to the fact that both types of acidity are components of the wine's overall acidity profile, though they are influenced by different factors (Jackson, 2008).


**5. Residual Sugar and Density (0.553)**

A moderate positive correlation between `Residual Sugar` and `Density` means that higher levels of residual sugar are associated with higher wine density. This is because sugars contribute to the overall weight of the wine, increasing its density (Ough & Amerine, 1988).


**6. Volatile Acidity and Fixed Acidity (0.219)**

A weak positive correlation between `Volatile Acidity` and `Fixed Acidity` indicates that wines with higher volatile acidity tend to have slightly higher fixed acidity. This can reflect the overall acidity balance in the wine, though the relationship is not strong (Jackson, 2008).


**7. Sulphates and Chlorides (0.396)**

There is a weak to moderate positive correlation between `Sulphates` and `Chlorides`. This suggests that higher sulfate levels in wine are somewhat associated with higher chloride levels. The connection might be due to the overall mineral content of the wine, although the correlation is not very strong (Boulton et al., 1996).


**8. Alcohol and Residual Sugar (-0.359)**

There is a moderate negative correlation between `Alcohol` and `Residual Sugar`. This means that as residual sugar levels increase, alcohol content tends to decrease. This can be explained by the fermentation process where higher sugar levels are often associated with lower final alcohol content due to incomplete fermentation (Ough & Amerine, 1988).

These correlations provide a clearer understanding of how different chemical properties interact in wine, which can be valuable for quality control and wine production processes.

In [None]:
# Create Quality-type variable
wine_data['qualitytype'] = wine_data['quality'].apply(lambda q: 'good' if q >= 6 else 'bad')

# Split data into train and test sets
train, test = train_test_split(wine_data, test_size=0.3, random_state=123)

# 3 Machine Learning Models

## Objective 1: Determine Essential Physicochemical Properties Affecting Wine Quality

## Data Preparation and Feature Engineering

Machine learning models typically require numerical input, so we need to convert categorical data to numerical form.
To address this, we can use one-hot encoding for the categorical variables. 

### One-Hot Encoding:

One-hot encoding was applied to the 'qualitytype' column

In [None]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OneHotEncoder

# One-hot encode the 'qualitytype' column
one_hot_encoder = OneHotEncoder(sparse=False, drop='first')
encoded_qualitytype = one_hot_encoder.fit_transform(train[['qualitytype']])
encoded_qualitytype_test = one_hot_encoder.transform(test[['qualitytype']])

# Add one-hot encoded columns back to the dataframe and drop original 'qualitytype' column
encoded_qualitytype_df = pd.DataFrame(encoded_qualitytype, columns=one_hot_encoder.get_feature_names_out(['qualitytype']))
encoded_qualitytype_test_df = pd.DataFrame(encoded_qualitytype_test, columns=one_hot_encoder.get_feature_names_out(['qualitytype']))

train = pd.concat([train.reset_index(drop=True), encoded_qualitytype_df], axis=1).drop(columns=['qualitytype'])
test = pd.concat([test.reset_index(drop=True), encoded_qualitytype_test_df], axis=1).drop(columns=['qualitytype'])


The OneHotEncoder is used to transform the 'qualitytype' column into a one-hot encoded format.
The one-hot encoded columns are then concatenated back to the original training and testing DataFrames, and the original 'qualitytype' column is dropped.
This step converts the 'qualitytype' column into binary columns (one for each category), ensuring that the data is numerical. This transformation is crucial because machine learning models like Random Forest and XGBoost require numerical inputs.

## Model Training and Evaluation

### Random Forest Classifier
We trained a Random Forest Classifier to predict wine quality.

In [None]:
# Train Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=500, random_state=123)
rf_classifier.fit(train.drop(columns=['quality', 'type_bin']), train['quality'])

# Make predictions
predictions_rf = rf_classifier.predict(test.drop(columns=['quality', 'type_bin']))

# Evaluate model
conf_matrix_rf = confusion_matrix(test['quality'], predictions_rf)
print("Confusion Matrix:\n", conf_matrix_rf)

# Feature importance
importance_matrix = rf_classifier.feature_importances_
features = train.drop(columns=['quality', 'type_bin']).columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importance_matrix}).sort_values(by='Importance', ascending=False)

# Visualize feature importance
plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df, palette='viridis')
plt.title('Feature Importance from Random Forest Classifier')
plt.show()

* The confusion matrix provides insight into how well the model classifies different quality categories. The matrix indicates that the model performs well in distinguishing between most quality categories but struggles with some, especially for lower quality categories.
    - The model performs well for class 3 and class 4, showing a high number of correct predictions.
    - Class 1 is poorly predicted, suggesting that the model struggles to identify this class.
    - Class 2 also has a high number of misclassifications, primarily being mistaken for class 3.
    - Class 5 has significant confusion with class 4.
    - Classes 6 and 7 have limited data, which might contribute to their poor performance.

* The Feature importance analysis helps understand which features are most influential
    - **Top Features:** Features with the highest importance scores are most influential in predicting wine quality. These features should be carefully analyzed for their impact on wine characteristics.
    - **Impact on Wine Quality:** Understanding feature importance helps identify which factors most affect wine quality, guiding improvements in wine production and quality control.

## Objective 2: Build Regression Models for Predicting Wine Quality

###  Random Forest Regressor

For predicting the exact quality score, we employed a Random Forest Regressor. The RMSE value helps evaluate how well the model predicts actual quality scores.

In [None]:
# Random Forest Regression 
# Ensure 'quality' column is numeric
wine_data['quality'] = wine_data['quality'].astype(float)

# One-hot encode the 'qualitytype' column
one_hot_encoder = OneHotEncoder(sparse=False, drop='first')
encoded_qualitytype = one_hot_encoder.fit_transform(wine_data[['qualitytype']])

# Add one-hot encoded columns back to the dataframe and drop original 'qualitytype' column
encoded_qualitytype_df = pd.DataFrame(encoded_qualitytype, columns=one_hot_encoder.get_feature_names_out(['qualitytype']))
wine_data = pd.concat([wine_data.reset_index(drop=True), encoded_qualitytype_df], axis=1).drop(columns=['qualitytype'])

# Split the data into training and testing sets
train_set, test_set = train_test_split(wine_data, test_size=0.3, random_state=123)

# Train Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=500, random_state=123)
rf_regressor.fit(train_set.drop(columns=['quality', 'type_bin']), train_set['quality'])

# Make predictions
predictions_rf_reg = rf_regressor.predict(test_set.drop(columns=['quality', 'type_bin']))

# Calculate RMSE
rmse_rf_reg = np.sqrt(mean_squared_error(test_set['quality'], predictions_rf_reg))
print("Random Forest Regression RMSE:", rmse_rf_reg)

# Feature importance
importance_matrix = rf_regressor.feature_importances_
features = train_set.drop(columns=['quality', 'type_bin']).columns
importance_df = pd.DataFrame({'Feature': features, 'Importance': importance_matrix}).sort_values(by='Importance', ascending=False)

# Visualize feature importance
plt.figure(figsize=(10, 8))
sns.barplot(x='Importance', y='Feature', data=importance_df, palette='viridis')
plt.title('Feature Importance from Random Forest Regressor')
plt.show()

**Root Mean Squared Error (RMSE):**

**RMSE = 0.4165:** This metric measures the average error between the predicted and actual wine quality scores. A lower RMSE indicates better predictive accuracy, which can be critical for quality assessment and improvement.

### XGBoost Regression
We also used XGBoost for regression

In [None]:
# XGBoost Regression (Objective 2)
# Define predictors and responses
train_x = train_set.drop(columns=['quality', 'type_bin'])
train_y = train_set['quality']
test_x = test_set.drop(columns=['quality', 'type_bin'])
test_y = test_set['quality']

# Create DMatrix objects
train_xgb = xgb.DMatrix(data=train_x, label=train_y)
test_xgb = xgb.DMatrix(data=test_x, label=test_y)

# Train XGBoost model
xgb_model = xgb.train(params={'max_depth': 3, 'objective': 'reg:squarederror'}, dtrain=train_xgb, num_boost_round=100)

# Make predictions
pred_xgb = xgb_model.predict(test_xgb)

# Calculate RMSE
rmse_xgb = np.sqrt(mean_squared_error(test_y, pred_xgb))
print("XGBoost Regression RMSE:", rmse_xgb)

# Convert predictions and actual values to categories
pred_xgb_cat = pd.cut(pred_xgb, bins=[0, 4.5, 6.5, 10], labels=['Low', 'Medium', 'High'])
test_y_cat = pd.cut(test_y, bins=[0, 4.5, 6.5, 10], labels=['Low', 'Medium', 'High'])

# Create confusion matrix
conf_matrix_xgb = confusion_matrix(test_y_cat, pred_xgb_cat)
print("XGBoost Confusion Matrix:\n", conf_matrix_xgb)


**RMSE Comparison:**

**RMSE = 0.4558:** XGBoost has a slightly higher RMSE compared to the Random Forest Regressor. This indicates that while XGBoost performs well, it may not be as accurate as Random Forest in predicting exact quality scores.

**Confusion Matrix**
We also used XGBoost to create a confusion matrix by categorizing the predicted and actual values. The model performs well in predicting high quality (Class 3), but struggles more with low quality predictions. This suggests potential improvements in handling lower quality predictions.

### Decision Tree Classification
We trained a Decision Tree Classifier to see how well it performs. Decision Trees can be used to make interpretative decisions about the most influential features, which can guide quality control measures.

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.tree import export_text

# Ensure 'quality' column is treated as a categorical variable
wine_data['quality'] = wine_data['quality'].astype('category')

# Split the data into training and testing sets
train_set, test_set = train_test_split(wine_data, test_size=0.3, random_state=123)

# Train the Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=123)
dt_classifier.fit(train_set.drop(columns=['quality', 'type_bin']), train_set['quality'])

# Get feature names and class names
feature_names = train_set.drop(columns=['quality', 'type_bin']).columns
class_names = [str(cat) for cat in train_set['quality'].cat.categories]

# Get feature importances
importances = dt_classifier.feature_importances_
indices = np.argsort(importances)[::-1]

# Plot feature importances
plt.figure(figsize=(12, 8))
plt.title('Feature Importances')
plt.bar(range(train_set.drop(columns=['quality', 'type_bin']).shape[1]), importances[indices], align='center')
plt.xticks(range(train_set.drop(columns=['quality', 'type_bin']).shape[1]), np.array(feature_names)[indices], rotation=90)
plt.xlim([-1, train_set.drop(columns=['quality', 'type_bin']).shape[1]])
plt.show()



**Feature Importance:** Similar to Random Forest, the Decision Tree provides insights into which features are most impactful. This helps in understanding the key factors affecting wine quality.

## Objective 3: Build Classification Models for Predicting Wine Type

### Support Vector Machine (SVM) Classification
SVM was used for classification

In [None]:
# Support Vector Machine (SVM) Classification

from sklearn.svm import SVC

# Train SVM Classifier
svm_classifier = SVC(kernel='linear', random_state=123)
svm_classifier.fit(train.drop(columns=['type_bin', 'quality']), train['type_bin'])

# Make predictions
predictions_svm = svm_classifier.predict(test.drop(columns=['type_bin', 'quality']))

# Evaluate model
conf_matrix_svm = confusion_matrix(test['type_bin'], predictions_svm)
accuracy_svm = accuracy_score(test['type_bin'], predictions_svm)
precision_svm = precision_score(test['type_bin'], predictions_svm)
recall_svm = recall_score(test['type_bin'], predictions_svm)
f1_svm = f1_score(test['type_bin'], predictions_svm)

print("SVM Confusion Matrix:\n", conf_matrix_svm)
print(f"SVM Accuracy: {accuracy_svm}, Precision: {precision_svm}, Recall: {recall_svm}, F1 Score: {f1_svm}")


**Metrics:**

Accuracy: 0.9841
Precision: 0.9792
Recall: 0.9573
F1 Score: 0.9681


**Analysis**
SVM demonstrates high accuracy, precision, recall, and F1 score, indicating its effectiveness in classifying wine types accurately.

### k-Nearest Neighbors (k-NN) Classification
Finally, we used k-NN

In [None]:
# k-Nearest Neighbors (k-NN) Classification
from sklearn.neighbors import KNeighborsClassifier

# Train k-NN Classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(train.drop(columns=['type_bin', 'quality']), train['type_bin'])

# Make predictions
predictions_knn = knn_classifier.predict(test.drop(columns=['type_bin', 'quality']))

# Evaluate model
conf_matrix_knn = confusion_matrix(test['type_bin'], predictions_knn)
accuracy_knn = accuracy_score(test['type_bin'], predictions_knn)
precision_knn = precision_score(test['type_bin'], predictions_knn)
recall_knn = recall_score(test['type_bin'], predictions_knn)
f1_knn = f1_score(test['type_bin'], predictions_knn)

print("k-NN Confusion Matrix:\n", conf_matrix_knn)
print(f"k-NN Accuracy: {accuracy_knn}, Precision: {precision_knn}, Recall: {recall_knn}, F1 Score: {f1_knn}")


**Metrics:**

Accuracy: 0.9390
Precision: 0.9307
Recall: 0.8191
F1 Score: 0.8714


**Analysis**
k-NN is accurate but less precise than SVM, struggling with false positives. It is useful but may not be as effective in scenarios requiring high precision.

### Model Comparison
Finally, we compared the performance metrics of SVM and k-NN

In [None]:
# Create a DataFrame with performance metrics for SVM and k-NN
model_comparison_classification = pd.DataFrame({
    'Model': ['SVM', 'k-NN'],
    'Accuracy': [accuracy_svm, accuracy_knn],
    'Precision': [precision_svm, precision_knn],
    'Recall': [recall_svm, recall_knn],
    'F1 Score': [f1_svm, f1_knn]
})

# Display the comparison table
print(model_comparison_classification)


SVM vs k-NN: SVM outperforms k-NN across all metrics, making it the preferred model for classification tasks requiring high accuracy and precision.

### **Conclusion**

#### **Summary of Findings**

1. **Exploratory Data Analysis (EDA)**
   - The EDA reveals notable differences between red and white wines regarding acidity, alcohol content, and residual sugar. Red wines generally exhibit higher acidity and lower residual sugar, contributing to their bold flavor, while white wines have higher alcohol content and greater sweetness variability.
   - The quality distribution shows that red wines are predominantly mid-range in quality, whereas white wines span a broader quality range. Specific features such as citric acid, as well as relationships observed through scatter plots and correlation heatmaps, provide valuable insights into how these attributes influence wine quality.
   - These findings emphasize the importance of key features in determining wine quality and can guide both descriptive and predictive analyses for better understanding and improving wine characteristics.


2. **Machine Learning Models**
   - **Random Forest Classifier**
     - **Confusion Matrix Insights:**
       - Struggles with very low-quality wines (Class 1) and occasionally misclassifies lower-quality wines (Class 2).
       - Performs well with mid-range (Class 3) and higher-quality wines (Class 4), but shows some confusion between high-quality (Class 4) and very high-quality wines (Class 5).
       - Limited data and poor performance with Class 6 and Class 7 may reflect challenges in distinguishing the highest quality wines.
     - **Feature Importance:** Identifies crucial features for predicting wine quality, helping to focus on factors that significantly impact wine characteristics.


   - **Random Forest Regressor**
     - **RMSE: 0.4165** indicates good accuracy in predicting exact quality scores, suggesting that the model effectively captures the quality nuances of wines.


   - **XGBoost Regression**
     - **RMSE: 0.4558** shows slightly less accuracy compared to Random Forest but remains a strong model. Further tuning could enhance performance.


   - **Decision Tree Classifier**
     - Provides intuitive insights into feature importance and decision-making processes, aiding in the interpretation of how different features affect wine quality.


   - **Support Vector Machine (SVM) Classification**
     - **Metrics:**
       - Accuracy: 0.9841
       - Precision: 0.9792
       - Recall: 0.9573
       - F1 Score: 0.9681
     - SVM demonstrates high performance across all metrics, making it effective at distinguishing between different wine types, particularly for high and low-quality wines.


   - **k-Nearest Neighbors (k-NN) Classification**
     - **Metrics:**
       - Accuracy: 0.9390
       - Precision: 0.9307
       - Recall: 0.8191
       - F1 Score: 0.8714
     - While accurate, k-NN is less precise than SVM and may face challenges with false positives. It remains useful but might not be ideal for scenarios demanding high precision.


   - **Predictive Performance:** Random Forest and SVM are recommended for high accuracy and precision in predicting wine quality. XGBoost is also effective but slightly less accurate than Random Forest.
   - **Feature Analysis:** Key features impacting wine quality have been identified, guiding improvement efforts and focusing on influential attributes.
   - **Class-Specific Insights:** Models highlight difficulties in predicting extreme quality classes, suggesting a need for more data or improved features to differentiate these classes better.
   - **Model Suitability:** SVM is preferred for high accuracy and precision tasks, while Random Forest and Decision Trees offer valuable interpretability. k-NN is useful but may not be the best choice for high-precision needs.


#### **Practical Implications**

- **Wine Production and Quality Control:** Insights from the analysis can guide producers in optimizing their winemaking processes by focusing on key attributes such as fixed acidity, residual sugar, and alcohol content. Adjustments in these areas can enhance wine quality.
- **Feature Importance for Quality Improvement:** Understanding the most impactful attributes enables targeted efforts in improving wine quality, such as adjusting acidity and sugar levels to meet desired quality standards.
- **Model Selection for Quality Prediction:** Random Forest and XGBoost models are recommended for predicting wine quality due to their accuracy. These models can support quality assessment and improvement strategies.

#### **Suggested Future Work**

- **Deeper Feature Analysis:** Future research could involve a more detailed analysis of specific features and their interactions to refine the understanding of their impact on wine quality.
- **Additional Models and Techniques:** Exploring advanced machine learning techniques and models may further enhance prediction accuracy and provide additional insights.
- **Application in the Wine Industry:** Implementing these findings in practical settings, such as quality control and production optimization, can lead to improved wine quality and more targeted production practices.


# **References**

1. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. *Decision Support Systems, 47*(4), 547-553. doi:10.1016/j.dss.2009.05.016
2. Jackson, R. S. (2020). *Wine Science: Principles and Applications* (5th ed.). Academic Press.
3. Kennedy, J. A. (2008). Grape and wine phenolics: Observations and recent findings. *Ciencia e Investigación Agraria, 35*(3), 107-120. doi:10.4067/S0718-16202008000300001
4. Ribéreau-Gayon, P., Glories, Y., Maujean, A., & Dubourdieu, D. (2006). *Handbook of Enology, Volume 2: The Chemistry of Wine - Stabilization and Treatments* (2nd ed.). John Wiley & Sons, Ltd.
5. Breiman, L. (2001). Random forests. *Machine Learning, 45*(1), 5-32. doi:10.1023/A:1010933404324
6. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining* (pp. 785-794). doi:10.1145/2939672.2939785
7. Quinlan, J. R. (1986). Induction of decision trees. *Machine Learning, 1*(1), 81-106. doi:10.1023/A:1022643204877
8. Cortes, C., & Vapnik, V. (1995). Support-vector networks. *Machine Learning, 20*(3), 273-297. doi:10.1007/BF00994018
9. Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. *IEEE Transactions on Information Theory, 13*(1), 21-27. doi:10.1109/TIT.1967.1053964
10. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. doi:10.1007/978-0-387-84858-7

