# EDA on Wine Quality Dataset

Dataset source: https://www.kaggle.com/yasserh/wine-quality-dataset

## Description:

This datasets is related to red variants of the Portuguese "Vinho Verde" wine.The dataset describes the amount of various chemicals present in wine and their effect on it's quality. The datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).Your task is to predict the quality of wine using the given data.

A simple yet challenging project, to anticipate the quality of wine.
The complexity arises due to the fact that the dataset has fewer samples, & is highly imbalanced.

## Description of Dataset:

* `volatile acidity` :   Volatile acidity is the gaseous acids present in wine. The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste;
* `fixed acidity` :   Primary fixed acids found in wine are tartaric, succinic, citric, and malic. Most acids involved with wine or fixed or nonvolatile;
* `residual sugar` :   Amount of sugar left after fermentation. The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet;
* `citric acid` :    It is weak organic acid, found in citrus fruits naturally. If found in small quantities, citric acid can add ‘freshness’ and flavor to wines;
* `chlorides` :   Amount of salt present in wine;
* `free sulfur dioxide` :   So2 is used for prevention of wine by oxidation and microbial spoilage. The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisul-fite ion; it prevents microbial growth and the oxidation of wine;
* `total sulfur dioxide`: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine;
* `pH` :   In wine pH is used for checking acidity.Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3–4 on the pH scale;
* `density` : the density of water is close to that of water depending on the percent alcohol and sugar content;
* `sulphates` :    Added sulfites preserve freshness and protect wine from oxidation, and bacteria. A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant;
* `alcohol` :   Percent of alcohol present in wine;
* `quality`:output variable (based on sensory data, score between 0 and 10);
* `id`.

## Personal motivation for picking this dataset:

I don't drink a lot, so when I do want to enjoy a glass of wine with my family or my friends, I want to be able to choose a good quality wine based on the wine's characteristics and type.


In [2]:
# Importing libraries and other necessary dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [4]:
# importing Dataframe object
wine_df = pd.read_csv("data/wine_quality.csv")

In [6]:
# quick look at the dataset
wine_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [8]:
# check the shape of the DataFrame
wine_df.shape

(1143, 13)

In [15]:
list(wine_df)

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality',
 'Id']

#### From above lines we can learn that there are total 1143 observations with 13 different feature variables/attributes present in the Wine Quality dataset.

In [10]:
# Now, let's check the information about different variables/column from the dataset:
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1143 non-null   float64
 1   volatile acidity      1143 non-null   float64
 2   citric acid           1143 non-null   float64
 3   residual sugar        1143 non-null   float64
 4   chlorides             1143 non-null   float64
 5   free sulfur dioxide   1143 non-null   float64
 6   total sulfur dioxide  1143 non-null   float64
 7   density               1143 non-null   float64
 8   pH                    1143 non-null   float64
 9   sulphates             1143 non-null   float64
 10  alcohol               1143 non-null   float64
 11  quality               1143 non-null   int64  
 12  Id                    1143 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 116.2 KB


#### We can see that, all 13 columns are of numeric data types. Out of 13 variables, 11 are predictor variables,  'quality' column is an response variable and the last c