In [5]:
import pandas as pd
df = pd.read_csv('wine-quality-white-and-red.csv')

### We use describe to generte discriptive statistics of a DataFrame like:
Count: Number of non-null values.  
Mean: The average value.  
Std: Standart deviation, which measures the amount of variation or dispersion of a set of values  
Min: Minimum value.  
25%: First quartile (25th percentile).  
50% (or Median): Second quartile or median (50th percentile).  
75%: Third quartile (75th percentile).  
Max: Maximum value.  

In [6]:
df.describe()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0


### We use df.info() to get an overview of our DataFrame's structure and check for missing values

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   type                  6497 non-null   object 
 1   fixed acidity         6497 non-null   float64
 2   volatile acidity      6497 non-null   float64
 3   citric acid           6497 non-null   float64
 4   residual sugar        6497 non-null   float64
 5   chlorides             6497 non-null   float64
 6   free sulfur dioxide   6497 non-null   float64
 7   total sulfur dioxide  6497 non-null   float64
 8   density               6497 non-null   float64
 9   pH                    6497 non-null   float64
 10  sulphates             6497 non-null   float64
 11  alcohol               6497 non-null   float64
 12  quality               6497 non-null   int64  
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


#### No missing values in the dataset, and the 'type' column with object Dtype needs conversion. Since there are only two categories ('red' and 'white'), replacing them with 0 and 1 is straightforward. No need for LabelEncoder as it's a binary transformation.

In [8]:
df['type'] = df['type'].map({'red': 1, 'white': 0})

We can proceed with calculating the following:
- Mean.
- Median.
- Mode.
- Standard Deviation.
- Variation.
- Range of values.

In [11]:
mean_values = df.mean()
median_values = df.median()
mode_values = df.mode().iloc[0]  # Use iloc[0] to get the first mode in case there are multiple modes
std_deviation = df.std()
variation = df.var()
range_values = df.max() - df.min()
statistics_summary = pd.DataFrame({
    'Mean': mean_values,
    'Median': median_values,
    'Mode': mode_values,
    'Standard Deviation': std_deviation,
    'Variation': variation,
    'Range': range_values
})
print(statistics_summary)

Unnamed: 0,Mean,Median,Mode,Standard Deviation,Variation,Range
type,0.246114,0.0,0.0,0.430779,0.18557,1.0
fixed acidity,7.215307,7.0,6.8,1.296434,1.68074,12.1
volatile acidity,0.339666,0.29,0.28,0.164636,0.027105,1.5
citric acid,0.318633,0.31,0.3,0.145318,0.021117,1.66
residual sugar,5.443235,3.0,2.0,4.757804,22.636696,65.2
chlorides,0.056034,0.047,0.044,0.035034,0.001227,0.602
free sulfur dioxide,30.525319,29.0,29.0,17.7494,315.041192,288.0
total sulfur dioxide,115.744574,118.0,111.0,56.521855,3194.720039,434.0
density,0.994697,0.99489,0.9972,0.002999,9e-06,0.05187
pH,3.218501,3.21,3.16,0.160787,0.025853,1.29


#### now we will see what columns have the highest absolute correlation with the quality column

In [12]:
correlation_matrix = df.corr()
target_correlation = correlation_matrix['quality'].abs().sort_values(ascending=False)
# Display the results
print(target_correlation)


quality                 1.000000
alcohol                 0.444319
density                 0.305858
volatile acidity        0.265699
chlorides               0.200666
type                    0.119323
citric acid             0.085532
fixed acidity           0.076743
free sulfur dioxide     0.055463
total sulfur dioxide    0.041385
sulphates               0.038485
residual sugar          0.036980
pH                      0.019506
Name: quality, dtype: float64



The top 5 most influential features on wine quality are:

1. **Alcohol (0.444319):**
   - Wines with higher alcohol content tend to receive higher quality scores.

2. **Density (0.305858):**
   - There is a positive relationship between wine density and quality.

3. **Volatile Acidity (0.265699):**
   - Lower volatile acidity is associated with higher quality wines.

4. **Chlorides (0.200666):**
   - The amount of chloride salts influences wine quality.

5. **Type (0.119323):**
   - The type of wine (red or white) modestly affThese results are in line with industry expectations, underscoring the significance of alcohol content, density, volatile acidity, and chloride levels in influencing wine quality. While wine type plays a role, its impact is relatively modest compared to the other mentioned factors.ing wine quality.e.