# Data Exploration
## Introduction
This study presents a comprehensive exploratory data analysis of the Wine Quality datasets, which constitute the empirical foundation for the present investigation. The analysis encompasses both red and white wine variants from the Portuguese "Vinho Verde" collection, with the objective of elucidating the underlying patterns and physicochemical characteristics that determine wine quality assessments.

The analysis commences with the requisite library imports and the configuration of graphical parameters to ensure consistent, publication-standard visualizations throughout the investigative process.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

COLOR_RED = "#720026"
COLOR_WHITE = "#DBDD46"

The first dataset will be loaded and subjected to systematic examination of its constituent features, data types, and structural properties to establish a comprehensive understanding of the data architecture.

In [None]:
red_wine = pd.read_csv('../data/winequality-red.csv', sep=';')
red_wine.head()

The second dataset will then be examined using the same analytical approach

In [None]:
white_wine = pd.read_csv('../data/winequality-white.csv', sep=';')
white_wine.head()

It's clear that we're working exclusively with *numerical* (float) data, and both datasets share **identical feature sets**.

We will now examine additional dataset characteristics, including sample size, and compare statistical measures such as mean, standard deviation, and median across both datasets.

In [None]:
red_wine.describe()

And then with the second dataset.

In [None]:
white_wine.describe()

The initial observation reveals a substantial imbalance between the datasets: the red wine dataset comprises 1,599 samples while the white wine dataset contains 4,898 samples, resulting in an inherently unbalanced combined dataset. Furthermore, the majority of features demonstrate distinct statistical properties (mean, standard deviation, and median) across the two wine types, indicating that wine color significantly influences physicochemical characteristics. Notably, alcohol content represents the sole feature exhibiting comparable statistical properties between both datasets.

Let's now examine the datasets for any inconsistencies or data quality issues.

In [None]:
red_wine.isnull().sum()

In [None]:
white_wine.isnull().sum()

Fortunately, there are no missing values in either dataset.

To facilitate subsequent data manipulation procedures, the two datasets will be concatenated, with the concurrent implementation of an additional categorical variable to denote wine type/color classification.

In [None]:
red_wine['wine_type'] = 'red'
white_wine['wine_type'] = 'white'
wine_data = pd.concat([red_wine, white_wine], axis=0, ignore_index=True)

Subsequently, the distributions will be plotted to examine the underlying data patterns and distributional characteristics. To facilitate a more comprehensive comparison between the two datasets, quality values will be normalized and plotted as percentages to ensure an equitable comparative analysis.

In [None]:
red_quality_counts = red_wine['quality'].value_counts(normalize=True).sort_index() * 100
white_quality_counts = white_wine['quality'].value_counts(normalize=True).sort_index() * 100

all_quality_levels = sorted(set(red_quality_counts.index) | set(white_quality_counts.index))

red_quality_aligned = red_quality_counts.reindex(all_quality_levels, fill_value=0)
white_quality_aligned = white_quality_counts.reindex(all_quality_levels, fill_value=0)
balanced_quality_counts = (red_quality_aligned + white_quality_aligned) / 2

fig, ax = plt.subplots(figsize=(12, 6))

width = 0.25
x = all_quality_levels

bars1 = ax.bar([i - width for i in x], red_quality_aligned.values, width, label='Red Wine', color=COLOR_RED, alpha=0.7, edgecolor='black')
bars2 = ax.bar([i for i in x], white_quality_aligned.values, width, label='White Wine', color=COLOR_WHITE, alpha=0.7, edgecolor='black')
bars3 = ax.bar([i + width for i in x], balanced_quality_counts.values, width, label='Balanced Average', color='gray', alpha=0.7, edgecolor='black')

ax.set_title('Quality Distribution Comparison - Red vs White vs Balanced Average (%)')
ax.set_xlabel('Quality')
ax.set_ylabel('Percentage (%)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xticks(x)

plt.tight_layout()
plt.show()

The analysis suggests that white wines exhibit superior overall quality ratings compared to red wines.

The quality variable will now be encoded into a new binary feature of *good* and *bad* quality according to the experimental specifications (≥6 constitutes good quality, <6 constitutes poor quality).

In [None]:
wine_data['quality_binary'] = (wine_data['quality'] >= 6).astype(int)
red_wine['quality_binary'] = (red_wine['quality'] >= 6).astype(int)
white_wine['quality_binary'] = (white_wine['quality'] >= 6).astype(int)

And then let's plot with this classification.

In [None]:
def get_quality_percentages(data):
    return (data['quality_binary'].value_counts().sort_index() / len(data)) * 100

red_percentages = get_quality_percentages(red_wine)
white_percentages = get_quality_percentages(white_wine)

fig, ax = plt.subplots(figsize=(8, 6))

width = 0.35
x = [0, 1]  # 0: 'Bad' and 1: 'Good'

bars1 = ax.bar([i - width/2 for i in x], red_percentages.values, width,label='Red Wine', color=COLOR_RED, alpha=0.7, edgecolor='black')
bars2 = ax.bar([i + width/2 for i in x], white_percentages.values, width, label='White Wine', color=COLOR_WHITE, alpha=0.7, edgecolor='black')

ax.set_title('Quality Distribution Comparison - Red vs White Wine')
ax.set_xlabel('Wine Quality Category')
ax.set_ylabel('Percentuale (%)')
ax.set_xticks(x)
ax.set_xticklabels(['Bad (< 6)', 'Good (≥ 6)'])
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
ax.set_ylim(0, 100)

plt.tight_layout()
plt.show()

Let's take a deep dive in the feature distribution comparison.

In [None]:
feature_cols = [col for col in wine_data.columns 
                if col not in ['quality', 'quality_binary', 'wine_type']]

fig, axes = plt.subplots(4, 3, figsize=(15, 16))
axes = axes.ravel()

for i, col in enumerate(feature_cols):
    axes[i].hist([red_wine[col], white_wine[col]], 
                   bins=30, alpha=0.7, label=['Red Wine', 'White Wine'],
                   color=[COLOR_RED, COLOR_WHITE],
                   edgecolor='black', density=True)
    axes[i].set_title(f'Distribution: {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Density')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The analysis reveals substantial chemical differences between red and white wines. White wines are significantly sweeter and use nearly three times more sulfur dioxide preservatives. In contrast, red wines are more acidic across multiple measures, containing higher fixed acidity and volatile acidity levels, along with elevated chlorides, sulphates, and density.
Red wines show lower variability across most chemical properties, suggesting more consistent production processes, while white wines demonstrate greater measurement variance. Alcohol content remains nearly identical between wine types, as noticed before. Many features exhibit asymmetric distributions as chlorides and sulphates.

Boxplots will now be generated to identify potential outliers within the dataset.

In [None]:
fig, axes = plt.subplots(4, 3, figsize=(15, 16))
axes = axes.ravel()

for i, col in enumerate(feature_cols):
    data_to_plot = [red_wine[col], white_wine[col]]
    bp = axes[i].boxplot(data_to_plot, tick_labels=['Red Wine', 'White Wine'], patch_artist=True)
    
    colors = [COLOR_RED, COLOR_WHITE]
    for patch, color in zip(bp['boxes'], colors):
        patch.set_facecolor(color)
    
    axes[i].set_title(f'Box Plot: {col}')
    axes[i].set_ylabel(col)
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The analysis reveals the presence of numerous outliers across the majority of features, necessitating the development of appropriate outlier management strategies.