                    Project: White Wine Data Exploratory Data Analysis (EDA) 


## White Wine Data
## Exploratory Data Analysis (EDA)

The process of analyzing the data, discovering the patterns, spotting anomalies, testing hypotheses, and checking the assumption. We use summary statistics and graphical representations for the EDA.  <br><br>

Project Objective
<br><br>
Study the data set of white wine quality. The objective of the study is to conduct EDA on several parameters of the data set. 
<br>
Exploratory Data Analysis
<br>
    Understand the data using the pandas library
    
    1.    Print the first five rows of the dataset
    2.    Print the last five rows of the dataset
    3.    Find out the total number of rows and columns of the data set
    4.    Find out the columns, data types, and presence of null values or missing values in the data set
          Univariate Analysis
    5.    Find out missing values graphically
    6.    Find out the number of rows, mean, std deviation, min, Q1, Q2, Q3, max values for each variable. Document your         observations to check if the outliers are present in the data set
    7.    Draw the histogram, kernel density estimate (kde) to check the distribution and skewness for each variable. Document your observation
    8.    Create a frequency distribution table and bar chart for the output variable
    9.    Draw box plot for each variable and identify the IQR and outliers
        Multivariate Analysis
    11.   Draw the correlation matrix and identify the variables that are correlated to each other.

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt



import warnings
warnings.filterwarnings("ignore")



In [None]:
#import the file

imp = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
imp


In [None]:
imp.columns



## 1. Print the first five rows of the dataset

In [None]:
imp.head()

## 2. Print the last five rows of the dataset

In [None]:
imp.tail()

## 3.Find out the total number of rows and columns of the data set

In [None]:
print("Number of rows = ",len(imp))
print("Number of columns = ",len(imp.columns))

Total Number of rows are  and Total Number of columns are 

## 4.Find out the columns, data types, and presence of null values or missing values in the data set Univariate Analysis

In [None]:
#Finding out columns 

imp.columns

In [None]:
#Finding out datatypes

imp.dtypes.to_frame(name='dtypes').T

In [None]:
#Finding out datatype

imp.dtypes

In [None]:
imp.info()

In [None]:
#finding out null or not null values

imp.isnull()

In [None]:
imp.isnull().sum()

## 5. Find out missing values graphically

In [None]:
nx_hist = imp.isnull().sum()

plt.figure(figsize=[10,5])

#nx = plt.hist(x=nx_hist)

plt.grid(axis='y', alpha=0.75)

plt.xlabel('Value',fontsize=15)

plt.xticks(fontsize=15)

plt.yticks(fontsize=15)

plt.title('Tip Histogram - Bins 20',fontsize=15)

nx_hist.plot()

In [None]:
#We can do it as also:--

hist_1 = imp.isnull().sum()
hist_1.plot()

 ## 6.Find out the number of rows, mean, std deviation, min, Q1, Q2, Q3, max values for each variable. Document your         observations to check if the outliers are present in the data set

In [None]:
imp.describe()

In [None]:
round(imp.describe())

In [None]:
f_d = imp.describe()

In [None]:
fa = f_d['fixed acidity']
print(fa)

print()

fa_q1 = fa.iloc[4]
fa_q3 = fa.iloc[6]

print("Fixed Acidity Q1 : ",fa_q1)
print("Fixed Acidity Q3 : ",fa_q3)

print()

fa_iqr = fa_q3 - fa_q1
print("Fixed Acidity IQR : ",fa_iqr)

print()

fa_out_l = fa_q1 - (1.5 * fa_iqr)
print("Fixed Acidity Lower Outliers : ",fa_out_l)

fa_out_u = fa_q3 + (1.5 * fa_iqr)
print("Fixed Acidity Upper Outliers : ",fa_out_u)

for i in fa:
    if(i<fa_out_l or i>fa_out_u):
        print("\nOutliers Of Fixed Acidity",i)

In [None]:
va = f_d['volatile acidity']
print(va)

print()

va_q1 = va.iloc[4]
va_q3 = va.iloc[6]

print("Volatile Acidity Q1 : ",fa_q1)
print("Volatile Acidity Q3 : ",fa_q3)

print()

va_iqr = va_q3 - va_q1
print("Volatile Acidity IQR : ",fa_iqr)

print()

va_out_l = va_q1 - (1.5 *va_iqr)
print("Volatile Acidity Lower Outliers : ",va_out_l)

va_out_u = va_q3 + (1.5 * va_iqr)
print("Volatile Acidity Upper Outliers : ",va_out_u)

for i in va:
    if(i<va_out_l or i>va_out_u):
        print("\nOutliers Of Volatile Acidity",i)

In [None]:
ca = f_d['citric acid']
print(ca)

print()

ca_q1 = ca.iloc[4]
ca_q3 = ca.iloc[6]

print("Citric Acid Q1 : ",ca_q1)
print("Citric Acid Q3 : ",ca_q3)

print()

ca_iqr = ca_q3 - ca_q1
print("Citric Acid IQR : ",ca_iqr)

print()

ca_out_l = ca_q1 - (1.5 * ca_iqr)
print("Citric Acid Lower Outliers : ",ca_out_l)

ca_out_u = ca_q3 + (1.5 * ca_iqr)
print("Citric Acid Upper Outliers : ",ca_out_u)

for i in ca:
    if(i<ca_out_l or i>ca_out_u):
        print("\nOutliers Of Citric Acid",i)

In [None]:
rs = f_d['residual sugar']
print(rs)

print()

rs_q1 = rs.iloc[4]
rs_q3 = rs.iloc[6]

print("Residual Sugar Q1 : ",rs_q1)
print("Residual Sugar Q3 : ",rs_q3)

print()

rs_iqr = rs_q3 - rs_q1
print("Residual Sugar IQR : ",rs_iqr)

print()

rs_out_l = rs_q1 - (1.5 * rs_iqr)
print("Residual Sugar Lower Outliers : ",rs_out_l)

rs_out_u = rs_q3 + (1.5 * rs_iqr)
print("Residual Sugar Upper Outliers : ",rs_out_u)

for i in rs:
    if(i<rs_out_l or i>rs_out_u):
        print("\nOutliers Of Residual Sugar",i)

In [None]:
c = f_d['chlorides']
print(c)

print()

c_q1 = c.iloc[4]
c_q3 = c.iloc[6]

print("Chlorides Q1 : ",c_q1)
print("Chlorides Q3 : ",c_q3)

print()

c_iqr = c_q3 - c_q1
print("Chlorides IQR : ",rs_iqr)

print()

c_out_l = c_q1 - (1.5 * c_iqr)
print("Chlorides Lower Outliers : ",c_out_l)

c_out_u = c_q3 + (1.5 * c_iqr)
print("Chlorides Upper Outliers : ",c_out_u)

for i in c:
    if(i<c_out_l or i>c_out_u):
        print("\nOutliers Of Chlorides",i)

In [None]:
fsd = f_d['free sulfur dioxide']
print(fsd)

print()

fsd_q1 = fsd.iloc[4]
fsd_q3 = fsd.iloc[6]

print("Free Sulfur Dioxide Q1 : ",fsd_q1)
print("Free Sulfur Dioxide Q3 : ",fsd_q3)

print()

fsd_iqr = fsd_q3 - fsd_q1
print("Free Sulfur Dioxide IQR : ",fsd_iqr)

print()

fsd_out_l = fsd_q1 - (1.5 * fsd_iqr)
print("Free Sulfur Dioxide Lower Outliers : ",fsd_out_l)

fsd_out_u = fsd_q3 + (1.5 * fsd_iqr)
print("Free Sulfur Dioxide Upper Outliers : ",fsd_out_u)

for i in fsd:
    if(i<fsd_out_l or i>fsd_out_u):
        print("\nOutliers Of Free Sulfur Dioxide",i)

In [None]:
tsd = f_d['total sulfur dioxide']
print(tsd)

print()

tsd_q1 = tsd.iloc[4]
tsd_q3 = tsd.iloc[6]

print("Total Sulfur Dioxide Q1 : ",tsd_q1)
print("Total Sulfur Dioxide Q3 : ",tsd_q3)

print()

tsd_iqr = tsd_q3 - tsd_q1
print("Total Sulfur Dioxide IQR : ",tsd_iqr)

print()

tsd_out_l = tsd_q1 - (1.5 * tsd_iqr)
print("Total Sulfur Dioxide Lower Outliers : ",tsd_out_l)

tsd_out_u = tsd_q3 + (1.5 * tsd_iqr)
print("Total Sulfur Dioxide Upper Outliers : ",tsd_out_u)

for i in tsd:
    if(i<tsd_out_l or i>tsd_out_u):
        print("\nOutliers Of Total Sulfur Dioxide",i)

In [None]:
d = f_d['density']
print(d)

print()

d_q1 = d.iloc[4]
d_q3 = d.iloc[6]

print("Density Q1 : ",d_q1)
print("Density Q3 : ",d_q3)

print()

d_iqr = d_q3 - d_q1
print("Density IQR : ",d_iqr)

print()

d_out_l = d_q1 - (1.5 * d_iqr)
print("Density Lower Outliers : ",d_out_l)

d_out_u = d_q3 + (1.5 * d_iqr)
print("Density Upper Outliers : ",d_out_u)

for i in d:
    if(i<d_out_l or i>d_out_u):
        print("\nOutliers Of Density",i)

In [None]:
ph = f_d['pH']
print(ph)

print()

ph_q1 = ph.iloc[4]
ph_q3 = ph.iloc[6]

print("PH Q1 : ",ph_q1)
print("PH Q3 : ",ph_q3)

print()

ph_iqr = ph_q3 - ph_q1
print("PH IQR : ",ph_iqr)

print()

ph_out_l = ph_q1 - (1.5 * ph_iqr)
print("PH Lower Outliers : ",ph_out_l)

ph_out_u = ph_q3 + (1.5 * ph_iqr)
print("PH Upper Outliers : ",d_out_u)

for i in ph:
    if(i<ph_out_l or i>ph_out_u):
        print("\nOutliers Of PH",i)

In [None]:
s = f_d['sulphates']
print(s)

print()

s_q1 = s.iloc[4]
s_q3 = s.iloc[6]

print("Sulphates Q1 : ",s_q1)
print("Sulphates Q3 : ",s_q3)

print()

s_iqr = s_q3 - s_q1
print("Sulphates IQR : ",s_iqr)

print()

s_out_l = s_q1 - (1.5 * s_iqr)
print("Sulphates Lower Outliers : ",s_out_l)

s_out_u = s_q3 + (1.5 * s_iqr)
print("Sulphates Upper Outliers : ",s_out_u)

for i in s:
    if(i<s_out_l or i>s_out_u):
        print("\nOutliers Of Sulphates",i)

In [None]:
a = f_d['alcohol']
print(a)

print()

a_q1 = a.iloc[4]
a_q3 = a.iloc[6]

print("Alcohol Q1 : ",a_q1)
print("Alcohol Q3 : ",a_q3)

print()

a_iqr = a_q3 - a_q1
print("Alcohol IQR : ",a_iqr)

print()

a_out_l = a_q1 - (1.5 * a_iqr)
print("Alcohol Lower Outliers : ",a_out_l)

a_out_u = a_q3 + (1.5 * a_iqr)
print("Alcohol Upper Outliers : ",a_out_u)

for i in a:
    if(i<a_out_l or i>a_out_u):
        print("\nOutliers Of Alcohol",i)

In [None]:
q = f_d['quality']
print(q)

print()

q_q1 = q.iloc[4]
q_q3 = q.iloc[6]

print("Quality Q1 : ",q_q1)
print("Quality Q3 : ",q_q3)

print()

q_iqr = q_q3 - q_q1
print("Quality IQR : ",q_iqr)

print()

q_out_l = q_q1 - (1.5 * q_iqr)
print("Quality Lower Outliers : ",q_out_l)

q_out_u = q_q3 + (1.5 * q_iqr)
print("Quality Upper Outliers : ",q_out_u)

for i in q:
    if(i<q_out_l or i>q_out_u):
        print("\nOutliers Of Quality",i)

## 7. Draw the histogram, kernel density estimate (kde) to check the distribution and skewness for each variable. Document your observation

## Histogram

## Fixed Acidity

In [None]:
plt.hist(imp['fixed acidity'],bins=10)

## Volatile Acidity

In [None]:
plt.hist(imp['volatile acidity'],bins=10)

## Citric Acid

In [None]:
plt.hist(imp['citric acid'],bins=10)

## Residual Sugar


In [None]:

plt.hist(imp['residual sugar'],bins=10)

## Chlorides

In [None]:
plt.hist(imp['chlorides'],bins=10)

## Free Sulfur Dioxide


In [None]:
plt.hist(imp['free sulfur dioxide'],bins=10)

## Total Sulfur Dioxide


In [None]:
plt.hist(imp['total sulfur dioxide'],bins=10)

## Density


In [None]:
plt.hist(imp['density'],bins=20)

## pH


In [None]:
plt.hist(imp['pH'],bins=20)

## Sulphates


In [None]:
plt.hist(imp['sulphates'],bins=20)

## Alcohol


In [None]:
plt.hist(imp['alcohol'],bins=20)

## Quality


In [None]:
plt.hist(imp['quality'],bins=7)

## Kernel Density Estimate



## Fixed Acidity


In [None]:
imp['fixed acidity'].plot.kde()

## Volatile Acidity


In [None]:
imp['volatile acidity'].plot.kde()

## Citric Acid


In [None]:
imp['citric acid'].plot.kde()

## Residual Sugar


In [None]:
imp['residual sugar'].plot.kde()

## Chlorides


In [None]:
imp['chlorides'].plot.kde()

## Free Sulfur Dioxide


In [None]:
imp['free sulfur dioxide'].plot.kde()

## total Sulfur Dioxide


In [None]:
imp['total sulfur dioxide'].plot.kde()

## Density


In [None]:
imp['density'].plot.kde()

## pH


In [None]:
imp['pH'].plot.kde()

## Sulphates


In [None]:
imp['sulphates'].plot.kde()

## Alcohol


In [None]:
imp['alcohol'].plot.kde()

## Quality


In [None]:
imp['quality'].plot.kde()

## Skewness Of Each Variable


In [None]:
imp.skew()

As we can see there is no negative value of any of the variable we can conclude that all of them are right skewed 

## Create a Fequency Distribution Table And Bar Chart For The Output Variable


In [None]:

sns.barplot(x=imp['fixed acidity'])
pd.crosstab(index=imp['fixed acidity'],columns='total')

In [None]:
sns.barplot(x=imp['volatile acidity'])

pd.crosstab(index=imp['volatile acidity'],columns='total')

In [None]:
sns.barplot(x=imp['citric acid'])

pd.crosstab(index= imp['citric acid'],columns='total')

In [None]:
sns.barplot(x=imp['residual sugar'])

pd.crosstab(index=imp['residual sugar'],columns='total')

In [None]:
sns.barplot(x=imp['chlorides'])

pd.crosstab(index=imp['chlorides'],columns='total')

In [None]:
sns.barplot(x=imp['free sulfur dioxide'])

pd.crosstab(index=imp['free sulfur dioxide'],columns='total')

In [None]:
sns.barplot(x=imp['density'])

pd.crosstab(index=imp['density'],columns='total')

In [None]:
sns.barplot(x=imp['pH'])

pd.crosstab(index=imp['pH'],columns='total')

In [None]:
sns.barplot(x=imp['sulphates'])

pd.crosstab(index=imp['sulphates'],columns='total')

In [None]:
sns.barplot(x=imp['alcohol'])

pd.crosstab(index=imp['alcohol'],columns='total')

In [None]:
sns.barplot(x=imp['quality'])

pd.crosstab(index=imp['quality'],columns='total')

## Draw box plot for each variable and identify the IQR and outliers

In [None]:
d = imp.describe()

## Fixed Acidity

In [None]:
sns.set(style="whitegrid")
sns.boxplot(x=imp['fixed acidity'],data=imp)

## Volatile Acidity

In [None]:
sns.boxplot(x=imp['volatile acidity'],data=imp)

### Citric Acid


In [None]:
sns.boxplot(x=imp['citric acid'],data=imp)

### Residual Sugar

In [None]:
sns.boxplot(x=imp['residual sugar'],data=imp)

### Chlorides

In [None]:
sns.boxplot(x=imp['chlorides'],data=imp)

### Free Sulfur Dioxide


In [None]:
sns.boxplot(x=imp['free sulfur dioxide'],data=imp)

### Total Sulfur Dioxide

In [None]:
sns.boxplot(x=imp['total sulfur dioxide'],data=imp)

## Density


In [None]:
sns.boxplot(x=imp['density'],data=imp)

## pH


In [None]:
sns.boxplot(x=imp['pH'],data=imp)

## Sulphates


In [None]:
sns.boxplot(x=imp['sulphates'],data=imp)

## Alcohol


In [None]:
sns.boxplot(x=imp['alcohol'],data=imp)

## Quality


In [None]:
sns.boxplot(x=imp['quality'],data=imp)

## 3.Multivariate Analysis
   # Draw the correlation matrix and identify the variables that are correlated to each other.¶

In [None]:
cr=imp.corr()
cr
cr.style.background_gradient(cmap='coolwarm')

In [None]:
plt.figure(figsize=[20,10])

sns.heatmap(imp.corr(), annot=True, cmap='coolwarm')

In [None]:
sns.pairplot(imp)

In [None]:
sns.relplot(y='fixed acidity' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='volatile acidity' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='citric acid' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='residual sugar' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='chlorides' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='free sulfur dioxide' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='total sulfur dioxide' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='density' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='pH' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='sulphates' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='alcohol' ,x='density', col='quality', data=imp)

In [None]:
sns.relplot(y='quality' ,x='density', col='quality', data=imp)

                                                    Thanks 

                                              Yash Kumar Sharma