# AIM:
To perform Hypothesis testing t test,z test, p value /ANOVA test

## Hypothesis

### T-Test (Example: Comparing Alcohol Content for High vs. Low-Quality Wines):

Null Hypothesis (H₀): The mean alcohol content is the same for high-quality and low-quality wines.

Alternative Hypothesis (H₁): There is a significant difference in the mean alcohol content between high-quality and low-quality wines.


### Z-Test (Example: Testing pH Against a Standard Value):

Null Hypothesis (H₀): The mean pH of the wines is equal to a standard pH value of 3.

Alternative Hypothesis (H₁): The mean pH of the wines is significantly different from the standard pH value.

### P-Value (Example: Assessing the Impact of Chlorides on Wine Quality):

Null Hypothesis (H₀): There is no association between chlorides and wine quality.

Alternative Hypothesis (H₁): There is a significant association between chlorides and wine quality.
### ANOVA (Example: Comparing Mean Alcohol Content Across Different Wine Quality Ratings):

Null Hypothesis (H₀): The mean alcohol content is the same across all wine quality ratings.

Alternative Hypothesis (H₁): At least one wine quality rating has a different mean alcohol content.

In [16]:
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.stats import t

In [8]:
df = pd.read_csv('winequality-red.csv')
df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [9]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [10]:
columns_to_mean = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']

## Z-Score Test

In [11]:
# Null hypothesis: Mean pH is equal to a standard value (e.g., 3.0)
null_mean = 3.31111
population_std = np.std(df['pH'])
print(np.mean(df['pH']))
# Calculate z-statistic
z_statistic = (np.mean(df['pH']) - null_mean) / (population_std / np.sqrt(len(df['pH'])))

# Critical value for a two-tailed test at 95% confidence level
alpha = 0.05
critical_value = norm.ppf(1 - alpha / 2)
print("Critical value:", critical_value)
print("Z-statistic:", z_statistic)
# Make a decision
if abs(z_statistic) > critical_value:
   print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")
    

3.3111131957473416
Critical value: 1.959963984540054
Z-statistic: 0.0008279865643324429
Fail to reject the null hypothesis


# P Value

In [12]:

# Calculate observed proportion of success
observed_proportion = np.sum(df["quality"]) / len(df["quality"])

# Calculate expected proportion under null hypothesis
expected_proportion = np.sum(df["chlorides"]) / len(df["chlorides"])

# Calculate chi-square statistic
chi_square_statistic = ((observed_proportion - expected_proportion) ** 2) / expected_proportion

# Degrees of freedom (for a 1-sample proportion test)
degrees_of_freedom = 1

# Calculate p-value
p_value = 1 - chi_square_statistic
print("P-value:", p_value)
# Make a decision
if p_value < 0.05:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")


P-value: -350.9800008169624
Reject the null hypothesis


In [18]:
threshold = 5  # Example threshold, you can set it according to your definition

# Create a new column 'Class' based on the threshold
df['Quality'] = df['quality'].apply(lambda x: 'High' if x >= threshold else 'Low')

# Display the DataFrame with the new 'Class' column
print(df[['quality', 'Quality']])
df['Quality'].value_counts()

      quality Quality
0           5    High
1           5    High
2           5    High
3           6    High
4           5    High
...       ...     ...
1594        5    High
1595        6    High
1596        6    High
1597        5    High
1598        6    High

[1599 rows x 2 columns]


Quality
High    1536
Low       63
Name: count, dtype: int64

## T Test

In [22]:
# Example data: alcohol content for high and low-quality wines
high_quality_alcohol = df[df['Quality'] == 'High']['alcohol']
low_quality_alcohol = df[df['Quality'] == 'Low']['alcohol']

# Calculate t-statistic
mean_diff = np.mean(high_quality_alcohol) - np.mean(low_quality_alcohol)
n1, n2 = len(high_quality_alcohol), len(low_quality_alcohol)
s1, s2 = np.var(high_quality_alcohol, ddof=1), np.var(low_quality_alcohol, ddof=1)

pooled_var = ((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2)

# Check if the denominator is not zero before calculating t-statistic
if pooled_var * (1/n1 + 1/n2) != 0:
    t_statistic = mean_diff / np.sqrt(pooled_var * (1/n1 + 1/n2))
    
    # Degrees of freedom
    degrees_of_freedom = n1 + n2 - 2

    # Critical value for a two-tailed test at 95% confidence level
    alpha = 0.05
    critical_value = t.ppf(1 - alpha / 2, degrees_of_freedom)

    # Make a decision
    if abs(t_statistic) > critical_value:
        print("Reject the null hypothesis")
    else:
        print("Fail to reject the null hypothesis")
else:
    print("Error: Denominator is zero. Cannot calculate t-statistic.")


Fail to reject the null hypothesis


# ANOVA

In [23]:
data = {'alcohol': df['alcohol'], 'quality': df['quality']}
data = pd.DataFrame(data)

quality_1 = np.array([data['alcohol'][i] for i in range(len(data['alcohol'])) if data['quality'][i] == 5])
quality_2 = np.array([data['alcohol'][i] for i in range(len(data['alcohol'])) if data['quality'][i] == 6])
quality_3 = np.array([data['alcohol'][i] for i in range(len(data['alcohol'])) if data['quality'][i] == 8])

overall_mean = np.mean(data['alcohol'])

ssb = sum(len(group) * (np.mean(group) - overall_mean)**2 for group in [quality_1, quality_2, quality_3])

dfb = len(set(data['quality'])) - 1

msb = ssb / dfb

ssw = sum((value - np.mean(data['alcohol']))**2 for value in data['alcohol'])

dfw = len(data['alcohol']) - len(set(data['quality']))

msw = ssw / dfw

f_statistic = msb / msw

f_dof_between = dfb
f_dof_within = dfw

# Critical value for a significance level of 0.05
alpha = 0.05
critical_value = 3.354  

if f_statistic > critical_value:
    print("Reject the null hypothesis (There is a significant difference between group means)")
else:
    print("Fail to reject the null hypothesis (No significant difference between group means)")


Reject the null hypothesis (There is a significant difference between group means)
