# Chemsha Bongo; Hypothesis Testing 

Using the  ```Wine Quality Dataset```, test the claim that Wines with a higher quality rating have a higher median alcohol content than wines with a lower quality rating.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import ttest_ind
from scipy.stats import median_test

# Scientific Question

##### What is the relationship between wine quality rating and median alcohol content in the Wine Quality Dataset, and is there evidence to support the claim that wines with a higher quality rating have a higher median alcohol content than wines with a lower quality rating?

### Data Understanding

#### Reading the data

In [None]:
wine_df = pd.read_csv('Data/winequality-white.csv', sep=';')
wine_df

### Sampling the data with required columns

In [None]:
alcohol_df = wine_df[['alcohol']]
quality_df = wine_df[['quality']]

print('Alcohol column:', alcohol_df)
print('Quality column:', quality_df)

In [None]:
wine_df.alcohol.unique()

In [None]:
# Checking distribution of the values

sns.displot(quality_df, kde=True, stat='density');

sns.displot(alcohol_df, kde=True, stat='density', bins=5);

In [None]:
# Statistical summary of relevant columns
print('summary of alcohol content:', alcohol_df.describe())

print('summary of quality:', quality_df.describe())

### Resampling the data

In [None]:
sample = wine_df.sample(n=300, random_state=42)
sample

In [None]:
# checking for distribution pattern
sns.displot(sample['alcohol'], kde=True, stat='density')
sns.displot(sample['quality'], kde=True, stat='density');

In [None]:
high_quality = sample.loc[sample['quality'] >= 6,['alcohol','quality']]

low_quality = sample.loc[sample['quality'] <= 5, ['alcohol','quality']]

# remove any NaN values from the series
high_quality = high_quality.dropna()
low_quality = low_quality.dropna()

# convert the series to a numeric type
high_quality = high_quality.astype('float')
low_quality = low_quality.astype('float')

# ensure these query included each sample exactly once
num_samples = sample.shape[0]
num_samples == low_quality['quality'].count() + high_quality['quality'].count()

In [None]:
sns.boxplot(data=low_quality.values );

In [None]:
sns.boxplot(data=high_quality.values );

In [None]:
# overall median of the alcohol content
median_alcohol_content = wine_df.alcohol.median()
print('Overall median of the alcohol content;', median_alcohol_content)

# overall mean of the alcohol content
mean_alcohol_content = wine_df.alcohol.mean()
print('Overall mean of the alcohol content;', mean_alcohol_content)

## Bartlett's test 

Bartlett's test is a statistical test used to determine whether the variances of two or more sets of data are equal. In other words, it helps to determine if the differences between the data sets are due to chance or if they are significant.

In [None]:
# import bartletts test 
from scipy.stats import bartlett

# subsetting the data 
high_quality_a = high_quality['alcohol']
low_quality_a = low_quality['alcohol']

# bartlett's test 
stat, p = bartlett(high_quality_f, low_quality_f)

# display the results
print("Bartlett's test statistic = ", stat)
print(f"p-value =  {p:.20f}")

The Bartlett's test was performed and resulted in a test statistic of 19.7 and a p-value of 0.000009. This suggests that there is a significant difference between the variances of the data sets being compared.
Based on the results of Bartlett's test indicating significant differences in the variances of the data sets, the next steps would be to explore and analyze the data further to determine the reasons for these differences.
Additionally, alternative tests could be considered, such as Welch's test as below:


In [None]:
#alcohol content Statistics
# sample means
high_a_mean = high_quality_a.mean()
low_a_mean = low_quality_a.mean()

print("The mean for high alcohol content is:", high_mean)
print("The mean for low alcohol content is:", low_mean)

# sample Variances
high_a_var = high_quality_a.var()
low_a_var = low_quality_a.var()

print("The variance for high alcohol content is:", high_a_var)
print("The variance for low alcohol content is:", low_a_var)

# length of Sample
high_alcohol_len = len(high_quality_a)
low_alcohol_len = len(low_quality_a)

print("The length for high quality is:", high_alcohol_len)
print("The length for low quality is:", low_alcohol_len)

In [None]:
# Welch's t-test for alcohol content
t, p_val = stats.ttest_ind(high_quality_a, low_quality_a, equal_var=False)

print("The t-statistic for the Welch test is:", t)
print("The p-value for the Welch test is:", p_val)

In [None]:
alpha = 0.05
if p_val.all() < alpha:
    print('Reject the null hypothesis')
else:
    print('Fail to reject the null hypothesis')

The Welch test was used to compare the mean alcohol content of high-quality and low-quality wines. The t-statistic for the test was 4.4035, which indicates that the difference between the means of the two samples is statistically significant. This means that there is strong evidence to suggest that the alcohol content of high-quality wines is different from that of low-quality wines.

## Define the Hypothesis

##### Null Hypothesis(Ho): There is no difference in the median alcohol content between wines with higher quality rating and wines with low quality rating. 

##### Alternative Hypothesis(Ha): Wines with higher quality rating have a higher median alcohol content than the wines with a lower quality rating. 

## Statistical Test

To test the claim that wines with a higher quality rating have a higher median alcohol content than wines with a lower quality rating, we can use a non-parametric test such as the two sample ttest where The null hypothesis for this test is that there is no difference in the means of the two groups. The alternative hypothesis is that the means are different.


In [None]:
# Perform a two-sample t-test to compare the means of the alcohol content for the two groups
stat, p_value = ttest_ind(high_quality['alcohol'], low_quality['alcohol'])
print('Two-sample t-test:')
print('Statistic:', stat)
print(f"p-value :  {p_value:.20f}")


In [None]:
alpha = 0.05
if p_value.all() < alpha:
    print('Reject the null hypothesis')
else:
    print('Fail to reject the null hypothesis')

## Findings

Therefore, we can reject the null hypothesis and conclude that there is a significant difference in the means of the alcohol content for high and low quality wines. Specifically, high quality wines have a significantly higher mean alcohol content than low quality wines.