# Exploratory Data Analysis

In [22]:
# Charement des ibliothèques
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Chargement et inspection initiale
- Lecture du dataset (`pd.read_csv`, `sep=';'`, etc.)
- Vérification du format (`df.shape`, `df.info()`, `df.head()`)
- Vérification des noms de colonnes, types (`df.dtypes`)
- Compter les valeurs uniques par colonne si besoin (`df.nunique()`)
> _Objectif : savoir ce qu’on a entre les mains._

In [23]:
# Chargement des données
df = pd.read_csv("../data/raw/winequality-red.csv", sep=';')

In [24]:
# head
df.head()        # aperçu des données

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [26]:
# Shape
df.shape

(1599, 12)

In [25]:
# Info
df.info()        # types de colonnes et valeurs manquantes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [28]:
# Noms et types de colonnes
df.dtypes

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

In [29]:
# Comptage de valeurs uniques
df.nunique()

fixed acidity            96
volatile acidity        143
citric acid              80
residual sugar           91
chlorides               153
free sulfur dioxide      60
total sulfur dioxide    144
density                 436
pH                       89
sulphates                96
alcohol                  65
quality                   6
dtype: int64

## Analyse des valeurs manquantes
- Compter les NaN (`df.isna().sum()`)
- Représentation graphique si besoin (`sns.heatmap(df.isna())`)
> _Objectif : identifier où sont les trous, sans encore décider comment les combler._

In [30]:
# Compter les NaN
df.isna().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

## Analyse des doublons
- `df.duplicated().sum()`
- Si besoin, inspection des doublons.
> _Objectif : voir s'il existe des enregistrements répétés._

In [31]:
# Vérifiaction de la présence de doublons
df.duplicated().sum()

np.int64(240)

## Statistiques descriptives
- `df.describe()`
- Moyenne, médiane, écart-type, min, max…
- Distribution de la variable cible (ici : `quality`)
> _Objectif : repérer les tendances, les corrélations, les outliers._

In [27]:
# Descirbe
df.describe()    # statistiques de base

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


## Visualisations exploratoires
- Histogrammes ou KDE plots pour les distributions (`sns.histplot`, `sns.boxplot`)
- Pairplots pour les corrélations (`sns.pairplot`)
- Heatmap de corrélation (`sns.heatmap(df.corr())`)
> _Objectif : repérer les tendances, les corrélations, les outliers._

## Analyse de la cible
- Répartition de la classe cible (`df['quality'].value_counts()`)
- Corrélation entre la cible et les features numériques
> _Objectif : comprendre ce que le modèle devra apprendre et s’il y a un déséquilibre._