# Data Exploration
## Introduction
This study presents a comprehensive exploratory data analysis of the Wine Quality datasets, which constitute the empirical foundation for the present investigation. The analysis encompasses both red and white wine variants from the Portuguese "Vinho Verde" collection, with the objective of elucidating the underlying patterns and physicochemical characteristics that determine wine quality assessments.

The analysis commences with the requisite library imports and the configuration of graphical parameters to ensure consistent, publication-standard visualizations throughout the investigative process.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

COLOR_RED = "#720026"
COLOR_WHITE = "#DBDD46"

The first dataset will be loaded and subjected to systematic examination of its constituent features, data types, and structural properties to establish a comprehensive understanding of the data architecture.

In [None]:
red_wine = pd.read_csv('../data/winequality-red.csv', sep=';')
red_wine.head()

The second dataset will then be examined using the same analytical approach

In [None]:
white_wine = pd.read_csv('../data/winequality-white.csv', sep=';')
white_wine.head()

It's clear that we're working exclusively with *numerical* (float) data, and both datasets share **identical feature sets**.

We will now examine additional dataset characteristics, including sample size, and compare statistical measures such as mean, standard deviation, and median across both datasets.

In [None]:
red_wine.describe()

And then with the second dataset.

In [None]:
white_wine.describe()

The initial observation reveals a substantial imbalance between the datasets: the red wine dataset comprises 1,599 samples while the white wine dataset contains 4,898 samples, resulting in an inherently unbalanced combined dataset. Furthermore, the majority of features demonstrate distinct statistical properties (mean, standard deviation, and median) across the two wine types, indicating that wine color significantly influences physicochemical characteristics. Notably, alcohol content represents the sole feature exhibiting comparable statistical properties between both datasets.

Let's now examine the datasets for any inconsistencies or data quality issues.

In [None]:
red_wine.isnull().sum()

In [None]:
white_wine.isnull().sum()

Fortunately, there are no missing values in either dataset.