In [None]:
# Please Ignore It

import pandas as pd
pd.options.display.max_colwidth = 300
import numpy as np
data = pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")
feature_desc = ['most acids involved with wine or fixed or nonvolatile (do not evaporate readily)', 
                'he amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste',
                'found in small quantities, citric acid can add freshness and flavor to wines',
                "the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet",
                'the amount of salt in the wine', 
                'the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine',
                "amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine",
                "the density of water is close to that of water depending on the percent alcohol and sugar content",
                "describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale",
                "a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant",
                "-",
                "score between 0 and 10"]
feature_desc = pd.DataFrame(feature_desc, columns=['Description'], index=data.columns)
feature_desc


In [None]:
# importing important packages
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(style='darkgrid')
sns.set_palette(palette='pastel')
# set option to display all columns without collapse in notebook
pd.set_option('display.max_columns', None)
# setting column width to display all the content
pd.options.display.max_colwidth = 300
cmap = sns.diverging_palette(220, 10, as_cmap=True)

In [None]:
# reading the dataset
data = pd.read_csv("/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")
# let see first four records
data.head(4)

In [None]:
# plotting frequency of each category of dependent feature in bar plot 

plt.figure(figsize=(15,6))
quality_count = data['quality'].value_counts().sort_values(ascending=False).to_frame()
quality_count = quality_count.rename(columns={'quality':'Count'})
ax = sns.barplot(x=quality_count.index, y='Count', data=quality_count, palette="ch:.25")
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.1f'),
               (p.get_x() + p.get_width() / 2., p.get_height()),
               ha='center', va='center',
               xytext = (0,9),
               textcoords='offset points')
plt.xlabel("Quality", fontsize=15)
plt.ylabel("Count", fontsize=15)
plt.title("Bar plot of Quality", fontsize=20)
plt.show()

- Here we can see, majority of 5 if higher than any other.
- We can say that most of wine quality is of normal quality.
- May be wine quality with quality == 5 really different from quality of other, it can be possible.
- we will check this hypothesis later in this notebook.

In [None]:
data.info()

From dataframe info 
- Dataframe have 1599 records
- Dataframe have 11 indendent features and 1 dependent features
- There is no missing values
- There is no categorical feature or dt, all features are numerical features

In [None]:
table_nan = data.isna().sum().to_frame().style.background_gradient(cmap=cmap)
print("Missing values per feature")
table_nan


In [None]:
feature_desc = ['most acids involved with wine or fixed or nonvolatile (do not evaporate readily)', 
                'he amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste',
                'found in small quantities, citric acid can add freshness and flavor to wines',
                "the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet",
                'the amount of salt in the wine', 
                'the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine',
                "amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine",
                "the density of water is close to that of water depending on the percent alcohol and sugar content",
                "describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale",
                "a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant",
                "-",
                "score between 0 and 10"]
feature_desc = pd.DataFrame(feature_desc,
                            columns=['Description'],
                            index=data.columns)

In [None]:
data_desc = data.describe().T
data_description = pd.concat([feature_desc, data_desc], axis=1)

In [None]:
data_description

This dataframe represents the meaning of the dataframe and some basic statistical finding of every feature. let's try to find out some finding from it just for practice

- we can see that, mean and median of density feature is almost same. May be it is normally distributed, we can not say that directly like that we will check this hypothesis later in this notebook
- Similiarrly mean and median of chlorides is not same , not even closer, may be data is effected by outliers.



## Description Statistics



- In my opinion, Finding mean is not good approach because mean are easily affected by outliers.
- If data is highly skewed, we do not choose mean, here we will go with meddian.
- Good Median as well as outliers representation is boxplot.

In [None]:
f, ax = plt.subplots(3, 4, figsize=(25,15))
sns.despine(left=True)

sns.boxplot(data['fixed acidity'], ax=ax[0,0])
sns.boxplot(data['volatile acidity'], ax=ax[0,1])
sns.boxplot(data['citric acid'], ax=ax[0,2])
sns.boxplot(data['residual sugar'], ax=ax[0,3])
sns.boxplot(data['chlorides'], ax=ax[1,0])
sns.boxplot(data['density'], ax=ax[1,1])
sns.boxplot(data['pH'], ax=ax[1,2])
sns.boxplot(data['sulphates'], ax=ax[1,3])
sns.boxplot(data['alcohol'], ax=ax[2,0])
sns.boxplot(data['total sulfur dioxide'], ax=ax[2,1])
sns.boxplot(data['free sulfur dioxide'], ax=ax[2,2])
sns.boxplot(data['quality'], ax=ax[2,3])

plt.show()

We can clearly see that, a lot of feature have outliers.

Lets take a closer look at density and pH...

Comparing three famous and very useful keywords in statistics i.e, Mean, median and mode

In [None]:
density_mean = np.mean(data['density'])
density_median = np.median(data['density'])
density_mode = data['density'].mode()[0]

q1 = data['density'].quantile(0.25)
q3 = data['density'].quantile(0.75)

density_IQR = q3 - q1

print("Mean : {}".format(density_mean))
print("Median : {}".format(density_median))
print("Mode : {}".format(density_mode))
print("Inter Quantile Range : {}".format(density_IQR))

Mean and median of density feature is almost equal, density may be normally distributed? let's see

In [None]:
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {'height_ratios':(0.2, 1)})


sns.boxplot(data["density"], ax=ax_box)
ax_box.axvline(density_mean, color='r', linestyle='--')
ax_box.axvline(density_median, color='g', linestyle='-')
ax_box.axvline(density_mode, color='b', linestyle='-')

sns.distplot(data["density"], ax=ax_hist, fit=stats.norm)
ax_hist.axvline(density_mean, color='r', linestyle='--')
ax_hist.axvline(density_median, color='g', linestyle='-')
ax_hist.axvline(density_mode, color='b', linestyle='-')

plt.legend({'Mean':density_mean,'Median':density_median,'Mode':density_mode})

ax_box.set(xlabel='')
plt.show()

In one shot by only looking we can say that, yes data may be normally distributed but when we look at box plot of same feature, feature is hugely affacted by outliers
Let's see if the feature is normally distributed or not, using normal test

- H0 : density feature is normally distributed
- H1 : H0 is not correct


If `p value < 0.05` ; we can reject the null hypothesis else accept

In [None]:
normal = stats.normaltest(data['density'])
normal

- p value is 2.1473202738102222e-07 which is way less than 0.05 that means we can reject our null hypothesis.
- `distribution is not normally distributed`.

In [None]:
f, (ax_box, ax_dist) = plt.subplots(2, sharex=True, gridspec_kw = {"height_ratios":(0.2, 1)})

mean = np.mean(data['pH'])
median = np.median(data['pH'])
mode = data['pH'].mode()[0]

q1 = data['pH'].quantile(0.25)
q3 = data['pH'].quantile(0.75)
IQR = q3 - q1

print("Mean : {}".format(mean))
print("Median : {}".format(median))
print("Mode : {}".format(mode))
print("Inter Quantile Range : {}".format(IQR))

sns.boxplot(data['pH'], ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='--')
ax_box.axvline(mode, color='b', linestyle='--')

sns.distplot(data['pH'], ax=ax_dist, fit=stats.norm)
ax_dist.axvline(mean, color='r', linestyle='--')
ax_dist.axvline(median, color='g', linestyle='--')
ax_dist.axvline(mode, color='b', linestyle='--')

plt.legend({"Mean":mean, "Median":median, "Mode":mode})

ax_box.set(xlabel='')
plt.show()

Mean and median of pH feature is almost equal, pH may be normally distributed? let's see

- H0 : pH is normally distributed 
- H1 : H0 is not correct


In [None]:
normal = stats.normaltest(data['pH'])
normal

P value is way less than 0.05 which means we can reject null hypothesis 
- pH is also also not from normal distribution

In [None]:
fig = plt.figure()
_ = stats.probplot(data['pH'], plot=plt)

It is very close to the normal distribution, but on the plot, we have 'outliers' (blue points which are not on the red line) because it is not from normal distribution (Hypothesis rejected above).

In [None]:
# let's look at the residual sugar

f, (ax_box, ax_dist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios":(0.2, 1)})

mean = np.mean(data['residual sugar'])
median = np.median(data['residual sugar'])
mode = data['residual sugar'].mode()[0]

sns.boxplot(data['residual sugar'], ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
ax_box.axvline(mode, color='b', linestyle='-')

sns.distplot(data['residual sugar'], ax=ax_dist, fit=stats.norm)
ax_dist.axvline(mean, color='r', linestyle='--')
ax_dist.axvline(median, color='g', linestyle='-')
ax_dist.axvline(mode, color='b', linestyle='-')

ax_box.set(xlabel='')

plt.show()

This feature is highly skewed means highly affected by outliers, we have to treat these type of features.

### Correlation

Now we will find out the linear dependence between feaatures.

What if we want to find out how the amount is affected by other features which are some or other features in our dataset, this is nothing but correlation.

Most popular correlation is pearson correlation coefficient but keep in mind while using it as it's coefficient is very sensitive to outliers.

We will use spearman correlation, It is same as pearson but it's coefficient is not sensitive to outliers.

In [None]:
corr_data = data.corr(method='spearman')
mask = np.zeros_like(corr_data, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(corr_data, mask=mask, annot=True, square=True, linewidth=.1)
plt.show()

Correlation of Quality (dependent feature) feature :

In [None]:
f, ax = plt.subplots(figsize=(25, 1))
quality = corr_data.sort_values(by=['quality'], ascending=False).head(1).T
quality = quality.sort_values(by=['quality'],ascending=False).T
sns.heatmap(quality, cmap=cmap, annot=True)
plt.show()

correlation between quality and alcohol and fixed acidity and citric acid is very high let's see scatterness of these features


In [None]:
g = sns.lmplot(x='quality', y='alcohol', data=data, height=5, line_kws={'color':'red'})

In [None]:
g = sns.lmplot(x='fixed acidity', y='citric acid', data=data, height=5, line_kws={'color':'red'})

both plot features have linear relationship

## Inferential statistics

> * Sample mean and population mean
> * confidence intervals
> * hypothesis testing p-value and t-test

In [None]:
# Population mean of one of the features
# shape of data is (1599, 12)
pop_mean = data['pH'].mean()

# sample mean
sample_size = 500
np.random.seed(0)
sample = np.random.choice(a=data['pH'], size=sample_size)
sam_mean = np.mean(sample)

# printing mean of population and sample

print("Population mean of pH :{}", format(pop_mean))
print("Sample mean of pH :{}".format(sam_mean))

Mean of population and sample is almost same. COOL!

### Confidence Intervals
Next, we will see conidence interavals and will see, is true mean lie between confidence intervals or not?

In [None]:
# confidence interval

pop_std = data['pH'].std()
sam_std = np.std(sample)
z_critical = stats.norm.ppf(q=0.95)

print("Population mean (pop_mean) of pH :{}".format(pop_mean))
print('-'*70)
print("Sample mean (sam_mean) of pH:{}".format(sam_mean))
print('-'*70)
print("population standard deviation is (pop_std) of pH: {}".format(pop_std))
print('-'*70)
print("sample standard deviation (sam_std) of pH:{}".format(sam_std))
print('-'*70)
print("z critical value of pH : {}".format(z_critical))


In [None]:
from statsmodels.stats.weightstats import _zconfint_generic, _tconfint_generic

# if we know std for population
z_conf = _zconfint_generic(sam_mean, 
                          pop_std, 
                          0.05, 'two-sided')
print( "95% confidefrnce interval when we know population standard deviation", z_conf )

# if we know only sample std
t_conf = _tconfint_generic(sam_mean, sam_std,
                           sample_size - 1,
                           0.05, 'two-sided')
print ("95% confidence interval when we only know sample standard deviation", t_conf)

- True mean is 3.31
- The confidence interval includes the value of true mean :)

#### hypothesis testing 

remember we have created a hypothesis when we are exploring the dependent feature i.e, quality-- > let's test that hypothesis

- H0 : Wine alchol with quality 5 is really different from others
- H1 : H0 is not correct

In [None]:
from statsmodels.stats.weightstats import ztest

z_stats, p_value = ztest(x1=data[data['quality']==5]['alcohol'], value=data['alcohol'].mean())
print("With respect to alcohol :")
z_stats, p_value

- p value is way less than 0.05, that means we can reject null hypothesis
- Wine alchol with quality 5 is not really different from others

let's see fixed acidity in wines with quality == 5 is differ than other

In [None]:
z_stats, p_value = ztest(x1=data[data['quality']==5]['fixed acidity'], value=data['fixed acidity'].mean())
print("With respect to fixed acidity :")
z_stats, p_value

In [None]:
0.011003242880424308 < 0.05

- p value is way less than 0.05, that means we can reject null hypothesis
- Wine fixed acidity with quality 5 is not really different from others

Stay tuned ...

<strong style="color:red;">WOW!</strong> you are at the end of this notbook You upgrade yourself

let's create a hypothesis 

- H0 : This notebook is helpful for you 
- H1 : H0 is not correct
    
But the test is based on upvote and upvote 😁

If you <strong style="color:red;">Upvote</strong> this notebook than H0 is accepted means notebook is helpful otherwise read again and upvote if you find it useful

<strong style="color:green;">Thank you !</strong>