## EDA With Red Wine Data

Data Set Information:

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.  Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.


Attribute Information:

Input variables (based on physicochemical tests):
- 1 - fixed acidity
- 2 - volatile acidity
- 3 - citric acid
- 4 - residual sugar
- 5 - chlorides
- 6 - free sulfur dioxide
- 7 - total sulfur dioxide
- 8 - density
- 9 - pH
- 10 - sulphates
- 11 - alcohol

Output variable (based on sensory data):
- 12 - quality (score between 0 and 10)

In [None]:
# EDA is all about undestanding the data given dataset

In [None]:
import pandas as pd
df=pd.read_csv('winequality-red.csv')
df.head()

In [None]:
## Summary of the dataset
df.info()

In [None]:
## descriptive summary of the dataset
df.describe()

In [None]:
# to get no. of rows and columns
df.shape

In [None]:
## List down all the columns names
df.columns

In [None]:
df['quality'].unique()

In [None]:
## Checking if there is any  Missing values in the dataset

df.isnull().sum()

In [None]:
## Checking if there is any Duplicate records
# df[df.duplicated()] is used to identify and display duplicate rows in your DataFrame df
df[df.duplicated()]

'''
df.duplicated(): Returns a boolean Series where True indicates that a row is a duplicate of a previous row.
df[df.duplicated()]: Filters and shows only the rows that are duplicates.
Pandas compares entire rows. A row is marked as a duplicate if every value in that row matches exactly with a previous row.
'''

In [9]:
## Remove the duplicates
df.drop_duplicates(inplace=True)

In [None]:
df.shape

In [None]:
## Correlation
df.corr()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,6)) # This sets the size of the plot to 10 inches wide and 6 inches tall.
sns.heatmap(df.corr(),annot=True) # The annot=True argument displays the actual correlation numbers inside each cell of the heatmap.

In [None]:
## Visualization
#conclusion- It is an imbalanced dataset
df.quality.value_counts().plot(kind='bar')
plt.xlabel("Wine Quality")
plt.ylabel("Count")
plt.show()

'''
df.quality.value_counts()
Counts how many times each unique quality score appears in the dataset.

.plot(kind='bar')
Creates a bar chart to visualize the frequency of each wine quality score.

'''

In [None]:
df.head()

In [None]:
for column in df.columns:
    sns.histplot(df[column],kde=True)

'''
 is looping through each column in the DataFrame and plotting a histogram with a Kernel Density Estimate (KDE) for each. This is a great way to explore the distribution of each feature in your dataset during Exploratory Data Analysis (EDA).


'''

In [None]:
sns.histplot(df['alcohol'])
plt.title('Distribution of Alcohol Content in Red Wine')
plt.xlabel('Alcohol %')
plt.ylabel('Frequency')
plt.show()

In [None]:
#univariate,bivariate,multivariate analysis
sns.pairplot(df)

In [None]:
##categorical Plot
sns.catplot(x='quality', y='alcohol', data=df, kind="box")

'''
X-axis (quality):
Different wine quality ratings (e.g., 3, 4, 5, 6, 7, 8).

Y-axis (alcohol):
The percentage of alcohol in the wines.

Box Plot Details:

The box shows the interquartile range (IQR), which contains the middle 50% of alcohol values.
The line inside the box represents the median alcohol content for that quality rating.
The whiskers extend to show the rest of the distribution, except for outliers.
Dots outside the whiskers are outliers, indicating wines with unusually high or low alcohol content for that quality rating.
'''

In [None]:
df.head()

In [None]:
sns.scatterplot(x='alcohol',y='pH',hue='quality',data=df)